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1  Overview 

The  focus  of  this  project  was  learning  probabilistic  models  of  relational  data,  and  using  these 
models  to  interpret  new  relational  data.  Our  work  on  this  project  focused  on  several  areas: 

1.  Developing  undirected  probabilistic  models  for  representing  and  learning  graph  patterns. 

2.  Learning  patterns  involving  links  between  objects. 

3.  Learning  discriminative  models  for  classification  in  relational  data. 

4.  Developing  and  labeling  two  real-world  relational  data  set  —  one  involving  web  data  and 
the  other  a  social  network  —  and  evaluating  the  performance  of  our  methods  on  these  data 
sets. 

5.  Dealing  with  distributions  that  are  non-unifonn,  in  that  different  contexts  (time  periods, 
organizations)  have  statistically  different  properties. 

The  text  below  elaborates  on  some  of  the  work  that  took  place  along  these  thrusts.  In  addition, 
more  information  about  the  work  done  under  this  project  can  be  found  in  the  publications  that 
document  the  work,  listed  below. 

2  Relational  Markov  Networks 
2.1  Basic  Language 

As  one  of  our  key  directions,  we  developed  a  new  class  of  probabilistic  models  for  relational  data 
based  on  partially  directed  and  undirected  graphical  models  (chain  graphs  and  Markov  networks). 
This  class  of  models,  which  we  called  relational  Markov  network  (RMN)s,  is  particularly  well 
suited  to  the  task  of  prediction  in  structured,  relational  data.  These  models  can  incorporate  rich 
information  about  the  attributes  of  entities  and,  more  importantly,  the  link  graph  between  entities, 
for  high-precision  classification.  Furthermore,  they  address  two  limitations  of  the  relational 
Bayesian  networks  that  we  proposed  in  our  early  work.  First,  undirected  models  do  not  impose 
the  acyclicity  constraint  that  hinders  representation  of  many  relational  dependencies  in  directed 
models.  For  example,  symmetric  relations  like  Met(X,  Y)  or  asymmetric  relations  like 
KnowsAbout(X,Y)  that  can  have  cycles  present  a  challenge  for  directed  models.  Second, 
undirected  models  are  well  suited  for  discriminative  training,  which  generally  improves 
classification  accuracy.  We  have  developed  a  system  that  can  learn  and  reason  with  such  models 
efficiently  on  large  databases. 

We  began  by  experimenting  with  a  publicly  available  dataset  of  several  web  sites  of  computer 
science  departments  at  major  universities  (WebKB).  The  task  consists  of  identifying  the  set  of 
student,  faculty,  courses  and  research  projects  at  the  department  from  the  1000  to  2000  web 
pages  collected  by  a  web  crawl  of  each  site.  The  entities  roughly  correspond  to  web  pages  and 
links  to  hyperlinks  between  them.  Additionally,  the  web  pages  themselves  are  structured, 
consisting  of  different  sections.  This  problem  presents  several  interesting  challenges  typical  of 
natural  language  and  relational  domains  —  text  and  interconnections  of  web  pages  are  very 
heterogeneous  as  they  are  authored  by  many  different  people.  However  there  are  strong  relational 
patterns  that  can  be  exploited  to  identify  different  entities.  For  example,  pages  of  faculty 
members  tend  to  contain  a  list  of  students  they  advise,  courses  they  teach,  projects  they  manage, 
etc.  Students  tend  to  link  to  their  advisor,  courses  they  took  or  assisted,  projects  they  are  involve 
in,  etc. 
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Using  various  models  that  incorporate  relational  patterns,  we  reduced  the  classification  error 
rate  from  18%  for  a  very  strong  approach  that  uses  “flat”  data,  to  1 1.5%  for  our  relational 
approach.  This  is  a  significant  reduction  in  error  of  over  35%.  We  note  that  the  error  reduction 
relative  to  our  earlier  approach  using  directed  probabilistic  relational  models  is  even  greater.  Our 
experiments  on  this  and  other  datasets  involved  reasoning  in  networks  containing  over  200 
thousand  entities  connected  by  over  600  thousand  links.  The  largest  experiments  take  several 
hours  on  a  700MHz  with  2GB  per  process  machine 

2.2  Link  Prediction 

We  then  investigated  the  application  of  the  RMN  language  to  the  problem  of  predicting  the 
existence  and  the  type  of  relationships  between  two  entities.  For  example,  we  want  to  be  able  to 
assert  not  only  that  X  is  a  professor  and  Y  a  student,  but  that  X  is  F  s  advisor.  In  a  terrorist 
domain,  we  might  want  to  conclude  not  only  that  X  is  a  bank  and  Y  a  terrorist  organization,  but 
that  the  relationship  between  them  is  that  of  money  laundering. 

We  developed  a  suite  of  subgraph  patterns  that  occur  in  real-world  relational  graphs  and 
showed  that  they  can  be  encoded  easily  in  our  RMN  language.  Such  patterns  include,  for 
example,  transitivity  —  if  Xknows  Y  and  Y  knows  Z  then  X  is  more  likely  to  know  Z,  as  well  as 
richer  patterns  —  if  Xis  Fs  advisor  and  X  teaches  course  Z  then  Y  is  more  likely  to  be  the  TA  for 
course  Z.  We  also  extended  our  learning  and  inference  algorithms  to  work  on  the  problem  of  link 
classification. 

2.3  Dataset  Development  and  Experimentation 

To  test  this  algorithm,  we  constructed  a  rich  dataset  that  extends  the  WebKB  dataset  mentioned 
above.  The  new  dataset  incorporates  large  web  sites  of  four  new  schools  (Stanford,  Berkeley, 
MIT,  and  CMU)  and  uses  a  more  refined  ontology  of  entities  that  includes  students,  faculty,  staff, 
research  scientists,  courses,  research  projects,  research  groups,  etc.  In  addition,  links  between 
entities  are  labeled  as  well:  student-advisor,  course-instructor,  and  course-ta  (teaching  assistant), 
project-member,  group-member,  etc.  We  labeled  both  hyperlinks,  and  virtual  links,  where  one 
person’s  name  is  mentioned  on  the  webpage  of  another  person.  In  both  cases,  the  observed  link 
may  or  may  not  correspond  to  an  actual  relationship  (e.g.,  student-advisor).  We  built  several  tools 
that  allow  us  to  spider  and  label  such  data  efficiently.  Overall,  we  hand-labeled  1 1,000  webpages 
and  1 10,000  links. 

We  also  constructed  a  second  data  set  based  on  a  real-world  social  network  of  Stanford 
students.  For  each  student,  we  have  a  set  of  attributes,  such  as  their  hobbies,  residence,  major, 
etc.  We  also  have,  for  each  student,  the  other  students  they  consider  to  be  their  friends. 

On  the  university  data,  we  tested  different  models,  including  flat  classification  approaches, 
and  various  relational  approaches,  on  the  task  of  predicting  the  existence  and  type  of  link. 

Overall,  the  relational  models  performed  much  better  than  the  standard  “flat”  classification 
approaches,  increasing  the  average  classification  accuracy  from  55%  to  60%.  On  the  social 
network  data,  we  tried  a  somewhat  different  task,  where  some  random  subset  of  the  links  (10%, 
25%,  or  50%)  is  observed  in  the  test  data,  and  can  be  used  as  evidence  for  predicting  the 
remaining  links.  Using  just  the  observed  portion  of  links,  we  constructed  the  following  features: 
for  each  student,  the  proportion  of  students  in  the  residence  that  list  him/her  and  the  proportion  of 
students  he/she  lists;  for  each  pair  of  students,  the  proportion  of  other  students  they  have  as 
common  friends.  These  features  are  a  flat  version  of  relational  structure  and  dependencies 
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between  links,  and  therefore  serve  as  a  good  benchmark  for  comparing  against  flat  approaches. 
We  then  compared  several  models:  a  “flat”  model,  which  uses  these  features  for  predicting  links, 
as  well  as  a  feature  for  each  match  of  a  characteristic  of  the  two  people  involved  (e.g.,  both 
people  are  computer  science  majors  or  both  are  freshmen).  Our  relational  model  introduces  a 
correlation  between  each  pair  of  links  emanating  from  each  person,  allowing  an  interaction 
between  their  existence.  Overall,  the  relational  model  outperformed  the  flat  model  by  a 
statistically  significant  margin  (as  measured  by  a  paired  t-test). 

3  Collective  Classification 

As  a  parallel  thrust  to  our  development  of  probabilistic  relational  languages  and  associated 
learning  algorithms,  we  also  focused  on  the  fundamental  problem  of  learning  models  for 
collective  classification  —  classifying  an  entire  set  of  inter-dependent  entities  as  a  whole,  rather 
than  classifying  each  as  an  independent  instance.  Although  our  previous  learning  algorithms 
address  this  task  better  than  previous  approaches,  the  accuracies  they  achieved  were  still  lower 
than  we  would  like,  especially  for  domains  where  the  signal-to-noise  ratio  is  very  low  (i.e.,  the 
target  class  has  very  low  frequency  in  the  general  population).  We  have  therefore  developed  a 
new  approach  for  learning  the  parameters  of  our  undirected  relational  models.  Unlike  standard 
approaches,  which  try  to  optimize  the  conditional  likelihood  of  the  target  labels  given  the 
features,  our  approach  focuses  explicitly  on  the  classification  task.  In  particular,  motivated  by 
ideas  from  the  highly  successful  flat  classification  technique  of  “support  vector  machines 
(SVMs)”,  our  approach  tries  to  maximize  the  “margin”  -  the  difference  in  log-probability 
between  the  correct  label  and  all  other  labels. 

The  key  contribution  of  this  approach  is  that  it  can  apply  these  ideas  to  the  case  of  collective 
classification  of  an  entire  set  of  entities.  In  this  case,  the  overall  set  of  labels  for  the  set  is 
exponentially  large  -  a  set  of  n  entities,  each  of  which  has  k  labels,  has  a  total  of  k 11  total  labels. 

As  we  show,  we  can  exploit  the  structure  of  the  probabilistic  graphical  model  to  avoid  this 
exponential  blowup,  allowing  the  entire  learning  problem  to  be  formulated  as  a  compact  convex 
quadratic  program,  making  the  problem  amenable  to  a  variety  of  standard  methods.  Another 
major  feature  of  our  method  is  its  ability  to  use  “kernels”  -  a  method  that  allows  very  high¬ 
dimensional  (even  infinite-dimensional)  feature  spaces  to  be  used  efficiently.  Kernels  are  one  of 
the  factors  that  contribute  to  the  enonnous  success  of  SVMs  in  many  flat  classification  tasks. 

So  far,  we  have  applied  this  method  to  the  task  of  collectively  labeling  a  set  of  entities  related 
only  by  a  simple  link  structure  in  the  form  of  a  sequence.  This  type  of  structure  is  of  independent 
interest,  as  it  can  be  used  to  collectively  classify  a  sequence  of  related  events.  We  have 
experimented  with  this  approach  on  the  problem  of  optical  character  recognition  -  labeling  a 
sequence  of  character  images  that  form  a  word.  Our  approach  shows  a  relative  reduction  in  error 
of  45%  relative  to  the  state  of  the  art  probabilistic  models,  and  a  reduction  of  error  of  33% 
relative  to  the  best  flat  classifier  -  SVMs  with  kernels. 

4  Non-Stationary  Distributions 

In  this  quarter,  our  work  also  took  a  slightly  different  direction.  A  common  assumption  when 
performing  classification  tasks  is  that  both  the  training  and  the  operation  data  are  drawn  IID 
(Independent  and  Identically  Distributed)  from  some  fixed  distribution.  In  another  words,  we 
expect  regularities  found  in  the  training  data  to  show  up  in  operation  data,  and  vice  versa. 
However,  our  experience  with  the  university  website  classification  shows  that  this  is  often  not  the 
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case:  different  organizations  (universities)  often  exhibit  very  different  patterns.  In  general, 
training  and  operational  data  can  have  quite  different  distributions,  depending  on  factors  such  as 
when  the  data  was  collected,  or  where  it  was  collected,  or  by  whom  it  was  collected.  As  another 
example,  in  news  articles,  the  distribution  of  words  in  an  article  is  dependent  on  when  it  was 
written.  New  stories  emerge  over  time,  introducing  new  people  names,  new  place  names,  etc. 
Similarly,  in  identifying  terrorist  activity  from  communication  data  for  a  new  organization,  we 
expect  to  see  new  terms  that  we  have  rarely  or  never  encountered  before  in  organizations  used  in 
our  training  data.  Ignoring  the  existence  of  such  a  phenomenon,  results  in  learning  misleading 
patterns.  Moreover,  these  new  terms  can  be  very  useful  for  classification.  For  example, 
discovering  the  code  name  for  a  new  terrorist  operation  can  help  to  identify  relevant 
communications. 

We  introduced  an  approach  to  solve  this  problem  in  two  ways:  Firstly,  we  rely  on  terms  that 
we  have  seen  before  to  infer  the  meanings  of  new  features,  and  subsequently  use  these  new 
features  for  classification.  For  example,  in  examining  communications  data,  we  might  find  that  a 
certain  new  term  has  been  frequently  mentioned  together  with  the  name  of  terrorists.  As  such,  we 
infer  that  this  new  tenn  is  a  terrorist-related  tenn  too.  Our  second  way  is  to  leam  characteristics 
of  these  features  useful  for  classification.  For  example,  we  might  leam  that  names  of  restaurants 
are  more  often  keywords  compared  to  names  of  cars,  thus  this  will  help  us  focus  our  search  for 
useful  new  keywords. 

We  tested  our  approach  on  two  datasets:  a  news  article  collection  and  the  university  webpage 
collection  described  in  previous  reports.  Compared  to  state  of  the  art  approaches  that  do  not  take 
into  consideration  information  from  new  features,  our  approach  showed  a  relative  reduction  in 
error  rate  of  56.3%  for  the  news  article  dataset,  and  a  relative  reduction  in  error  rate  of  20.7%  for 
the  university  webpage  dataset. 

5  Publications  and  Presentations 
5.1  Publications 

1.  “Max-Margin  Markov  Nets,”  B.  Taskar,  C.  Guestrin,  and  D.  Koller.  Neural  Information 
Processing  Systems  Conference  (NIPS),  Vancouver,  Canada,  December  2003.  Winner  of 
the  Best  Student  Paper  Award. 

2.  “Link  Prediction  in  Relational  Data,”  B.  Taskar,  M.  F.  Wong,  P.  Abbeel,  and  D.  Koller. 
Neural  Information  Processing  Systems  Conference  (NIPS),  Vancouver,  Canada, 
December  2003.* 

3.  “Learning  on  the  Test  Data:  Leveraging  ‘Unseen’  Features,”  B.  Taskar,  M.-F.  Wong,  and 
D.  Koller.  Twentieth  International  Conference  on  Machine  Learning  (ICML),  Washington, 
D.C.,  August  2003. 

4.  “Learning  Associative  Markov  Networks,”  B.  Taskar,  V.  Chatalbashev,  and  D.  Koller, 
Proceedings  of  the  Twenty-First  International  Conference  on  Machine  Learning  (ICML), 
2004,  To  appear.* 

5.  “Discriminative  probabilistic  models  for  relational  data,”  B.  Taskar,  P.  Abbeel,  and 

D.  Koller.  Eighteenth  Annual  Conference  on  Uncertainty  in  Artificial  Intelligence  (UAI), 
Edmonton,  Canada,  August  2002,  pages  485-492.* 

*  These  publications  are  contained  in  the  appendices  beginning  on  page  7. 
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5.2  Presentations 


The  above  papers  were  all  presented  by  the  PI  or  by  one  of  her  students  at  the  respective 
conferences  where  they  appeared.  In  addition  the  work  performed  under  this  contract  was 
prominently  figured  in  two  plenary  invited  talks  given  by  the  PI: 

1 .  Invited  plenary  talk  at  the  40th  Anniversary  Meeting  of  the  Association  for 
Computational  Linguistics  (ACL  ’02),  Philadelphia,  Pennsylvania,  and  July  2002.  Title: 
“Probabilistic  Models  of  Relational  Data.” 

2.  Invited  plenary  talk:  “Probabilistic  Models  of  Relational  Data.”  Plenary  invited  talk  joint 
to  the  Twentieth  International  Conference  on  Machine  Learning  (ICML-2003)  and  the 
Ninth  ACM  SIGKDD  International  Conference  on  Knowledge  Discovery  and  Data  Mining 
(KDD-2003),  Washington,  DC,  August  2003. 
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6  Transitions 


The  technology  developed  under  this  contract  was  transitioned  in  two  primary  ways.  First,  some 
of  the  ideas  and  algorithms  were  used  by  Alphatech  as  part  of  the  EAGLE  TIEs.  Second,  the 
collective  classification  technology  developed  as  part  of  this  project  played  and  is  still  playing  a 
key  role  in  the  work  done  in  other  projects.  In  particular,  it  was  the  technical  basis  for  the  year  1 
deliverable  for  the  Calo  project,  funded  under  the  Perceptive  Assistant  that  Leams  (PAL) 
program.  We  are  currently  applying  the  same  method  to  the  project  under  a  learning  seedling,  to 
identify  and  recognize  objects  in  a  3D  scene. 
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Appendix  A: 


Link  Prediction  in  Relational  Data 


Ben  Taskar  Ming-Fai  Wong  Pieter  Abbeel  Daphne  Roller 

{btaskar,  mingfai.wong,  abbeel,  koller}@cs. stanford.edu 
Stanford  University 


Abstract 

Many  real-world  domains  are  relational  in  nature,  consisting  of  a  set  of  objects 
related  to  each  other  in  complex  ways.  This  paper  focuses  on  predicting  the 
existence  and  the  type  of  links  between  entities  in  such  domains.  We  apply  the 
relational  Markov  network  framework  of  Taskar  et  al.  to  define  a  joint  probabilis¬ 
tic  model  over  the  entire  link  graph  —  entity  attributes  and  links.  The  application 
of  the  RMN  algorithm  to  this  task  requires  the  definition  of  probabilistic  patterns 
over  subgraph  structures.  We  apply  this  method  to  two  new  relational  datasets, 
one  involving  university  webpages,  and  the  other  a  social  network.  We  show  that 
the  collective  classification  approach  of  RMNs,  and  the  introduction  of  subgraph 
patterns  over  link  labels,  provide  significant  improvements  in  accuracy  over  flat 
classification,  which  attempts  to  predict  each  link  in  isolation. 


1  Introduction 

Many  real  world  domains  are  richly  structured,  involving  entities  of  multiple  types  that 
are  related  to  each  other  through  a  network  of  different  types  of  links.  Such  data  poses 
new  challenges  to  machine  learning.  One  challenge  arises  from  the  task  of  predicting 
which  entities  are  related  to  which  others  and  what  are  the  types  of  these  relationships.  For 
example,  in  a  data  set  consisting  of  a  set  of  hyperlinked  university  webpages,  we  might 
want  to  predict  not  just  which  page  belongs  to  a  professor  and  which  to  a  student,  but  also 
which  professor  is  which  student’s  advisor.  In  some  cases,  the  existence  of  a  relationship 
will  be  predicted  by  the  presence  of  a  hyperlink  between  the  pages,  and  we  will  have  only 
to  decide  whether  the  link  reflects  an  advisor-advisee  relationship.  In  other  cases,  we  might 
have  to  infer  the  very  existence  of  a  link  from  indirect  evidence,  such  as  a  large  number 
of  co-authored  papers.  In  a  very  different  application,  we  might  want  to  predict  links 
representing  participation  of  individuals  in  certain  terrorist  activities. 

One  possible  approach  to  this  task  is  to  consider  the  presence  and/or  type  of  the  link 
using  only  attributes  of  the  potentially  linked  entities  and  of  the  link  itself.  For  example, 
in  our  university  example,  we  might  try  to  predict  and  classify  the  link  using  the  words  on 
the  two  webpages,  and  the  anchor  words  on  the  link  fif  present).  This  approach  has  the 
advantage  that  it  reduces  to  a  simple  classification  task  and  we  can  apply  standard  machine 
learning  techniques.  However,  it  completely  ignores  a  rich  source  of  information  that  is 
unique  to  this  task  —  the  graph  structure  of  the  link  graph.  For  example,  a  strong  predictor 
of  an  advisor-advisee  link  between  a  professor  and  a  student  is  the  fact  that  they  jointly 
participate  in  several  projects.  In  general,  the  link  graph  typically  reflects  common  patterns 
of  interactions  between  the  entities  in  the  domain.  Taking  these  patterns  into  consideration 
should  allow  us  to  provide  a  much  better  prediction  for  links. 

In  this  paper,  we  tackle  this  problem  using  the  relational  Markov  network  (RMN)  frame¬ 
work  of  Taskar  et  al.  [14].  We  use  this  framework  to  define  a  single  probabilistic  model 
over  the  entire  link  graph,  including  both  object  labels  (when  relevant)  and  links  between 
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objects.  The  model  parameters  are  trained  discriminatively,  to  maximize  the  probability 
of  the  (object  and)  link  labels  given  the  known  attributes  (e.g.,  the  words  on  the  page,  hy¬ 
perlinks).  The  learned  model  is  then  applied,  using  probabilistic  inference,  to  predict  and 
classify  links  using  any  observed  attributes  and  links. 

2  Link  Prediction 

A  relational  domain  is  described  by  a  relational  schema,  which  specifies  a  set  of  object 
types  and  attributes  for  them.  In  our  web  example,  we  have  a  Webpage  type,  where  each 
page  has  a  binary-valued  attribute  for  each  word  in  the  dictionary,  denoting  whether  the 
page  contains  the  word.  It  also  has  an  attribute  representing  the  “class”  of  the  webpage, 
e.g.,  a  professor’s  homepage,  a  student’s  homepage,  etc. 

To  address  the  link  prediction  problem,  we  need  to  make  links  first-class  citizens  in  our 
model.  Following  [5],  we  introduce  into  our  schema  object  types  that  correspond  to  links 
between  entities.  Each  link  object  l  is  associated  with  a  tuple  of  entity  objects  (oi, . . . ,  o*) 
that  participate  in  the  link.  For  example,  a  Hyperlink  link  object  would  be  associated  with 
a  pair  of  entities  —  the  linking  page,  and  the  linked-to  page,  which  are  part  of  the  link 
definition.  We  note  that  link  objects  may  also  have  other  attributes;  e.g.,  a  hyperlink  object 
might  have  attributes  for  the  anchor  words  on  the  link. 

As  our  goal  is  to  predict  link  existence,  we  must  consider  links  that  exist  and  links  that 
do  not.  We  therefore  consider  a  set  of  potential  links  between  entities.  Each  potential  link 
is  associated  with  a  tuple  of  entity  objects,  but  it  may  or  may  not  actually  exist.  We  denote 
this  event  using  a  binary  existence  attribute  Exists,  which  is  true  if  the  link  between  the 
associated  entities  exists  and  false  otherwise.  In  our  example,  our  model  may  contain  a 
potential  link  t  for  each  pair  of  webpages,  and  the  value  of  the  variable  l.Exists  determines 
whether  the  link  actually  exists  or  not.  The  link  prediction  task  now  reduces  to  the  problem 
of  predicting  the  existence  attributes  of  these  link  objects. 

An  instantiation  T  specifies  the  set  of  entities  of  each  entity  type  and  the  values  of  all 
attributes  for  all  of  the  entities.  For  example,  an  instantiation  of  the  hypertext  schema  is 
a  collection  of  webpages,  specifying  their  labels,  the  words  they  contain,  and  which  links 
between  them  exist.  A  partial  instantiation  specifies  the  set  of  objects,  and  values  for  some 
of  the  attributes.  In  the  link  prediction  task,  we  might  observe  all  of  the  attributes  for  all 
of  the  objects,  except  for  the  existence  attributes  for  the  links.  Our  goal  is  to  predict  these 
latter  attributes  given  the  rest. 

3  Relational  Markov  Networks 

We  begin  with  a  brief  review  of  the  framework  of  undirected  graphical  models  or  Markov 
Networks  [13],  and  their  extension  to  relational  domains  presented  in  [14], 

Let  V  denote  a  set  of  discrete  random  variables  and  v  an  assignment  of  values  to  V. 
A  Markov  network  for  V  defines  a  joint  distribution  over  V.  It  consists  of  an  undirected 
dependency  graph,  and  a  set  of  parameters  associated  with  the  graph.  For  a  graph  G,  a 
clique  c  is  a  set  of  nodes  Vc  in  G,  not  necessarily  maximal,  such  that  each  Vj .  Vj  £  Vc 
are  connected  by  an  edge  in  G.  Each  clique  c  is  associated  with  a  clique  potential  (f>c(Vc), 
which  is  a  non-negative  function  defined  on  the  joint  domain  of  Vc.  Letting  C(G)  be  the 
set  of  cliques,  the  Markov  network  defines  the  distribution  P(v)  =  ^  ricec(G)  <^c(vc)i 
where  Z  is  the  standard  normalizing  partition  function. 

A  relational  Markov  network  (RMN)  [14]  specifies  the  cliques  and  potentials  between 
attributes  of  related  entities  at  a  template  level,  so  a  single  model  provides  a  coherent  distri¬ 
bution  for  any  collection  of  instances  from  the  schema.  RMNs  specify  the  cliques  using  the 
notion  of  a  relational  clique  template,  which  specify  tuples  of  variables  in  the  instantiation 
using  a  relational  query  language.  (See  [14]  for  details.) 

For  example,  if  we  want  to  define  cliques  between  the  class  labels  of  linked  pages, 
we  might  define  a  clique  template  that  applies  to  all  pairs  pagel  ,page2  and  link  of  types 


8 


Webpage,  Webpage  and  Hyperlink,  respectively,  such  that  link  points  from  pagel  to 
page2.  We  then  define  a  potential  template  that  will  be  used  for  all  pairs  of  variables 
pagel.  Category  and  page2.  Category  for  such  pagel  and  page2. 

Given  a  particular  instantiation  I  of  the  schema,  the  RMN  M.  produces  an  unrolled 
Markov  network  over  the  attributes  of  entities  in  I,  in  the  obvious  way.  The  cliques  in  the 
unrolled  network  are  determined  by  the  clique  templates  C .  We  have  one  clique  for  each 
c  €  C('I),  and  all  of  these  cliques  are  associated  with  the  same  clique  potential  (pc- 

Taskar  et  al.  show  how  the  parameters  of  an  RMN  over  a  fixed  set  of  clique  templates 
can  be  learned  from  data.  In  this  case,  the  training  data  is  a  single  instantiation  I,  where 
the  same  parameters  are  used  multiple  times  —  once  for  each  different  entity  that  uses 
a  feature.  A  choice  of  clique  potential  parameters  w  specifies  a  particular  RMN,  which 
induces  a  probability  distribution  Pw  over  the  unrolled  Markov  network. 

Gradient  descent  over  w  is  used  to  optimize  the  conditional  likelihood  of  the  target  vari¬ 
ables  given  the  observed  variables  in  the  training  set.  The  gradient  involves  a  term  which 
is  the  posterior  probability  of  the  target  variables  given  the  observed,  whose  computation 
requires  that  we  run  probabilistic  inference  over  the  entire  unrolled  Markov  network.  In 
relational  domains,  this  network  is  typically  large  and  densely  connected,  making  exact 
inference  intractable.  Taskar  et  al.  therefore  propose  the  use  of  belief  propagation  [13,  17], 

4  Subgraph  Templates  in  a  Link  Graph 

The  structure  of  link  graphs  has  been  widely  used  to  infer  importance  of  documents  in 
scientific  publications  [4]  and  hypertext  (PageRank  [12],  Hubs  and  Authorities  [8]).  Social 
networks  have  been  extensively  analyzed  in  their  own  right  in  order  to  quantify  trends  in 
social  interactions  [16].  Link  graph  structure  has  also  been  used  to  improve  document 
classification  [7,  6,  15], 

In  our  experiments,  we  found  that  the  combination  of  a  relational  language  with  a  prob¬ 
abilistic  graphical  model  provides  a  very  flexible  framework  for  modeling  complex  patterns 
common  in  relational  graphs.  First,  as  observed  by  Getoor  et  al.  [5],  there  are  often  cor¬ 
relations  between  the  attributes  of  entities  and  the  relations  in  which  they  participate.  For 
example,  in  a  social  network,  people  with  the  same  hobby  are  more  likely  to  be  friends. 

We  can  also  exploit  correlations  between  the  labels  of  entities  and  the  relation  type.  For 
example,  only  students  can  be  teaching  assistants  in  a  course.  We  can  easily  capture  such 
correlations  by  introducing  cliques  that  involve  these  attributes.  Importantly,  these  cliques 
are  informative  even  when  attributes  are  not  observed  in  the  test  data.  For  example,  if  we 
have  evidence  indicating  an  advisor-advisee  relationship,  our  probability  that  X  is  a  faculty 
member  increases,  and  thereby  our  belief  that  X  participates  in  a  teaching  assistant  link 
with  some  entity  Z  decreases. 

We  also  found  it  useful  to  consider  richer  subgraph  templates  over  the  link  graph.  One 
useful  type  of  template  is  a  similarity  template,  where  objects  that  share  a  certain  graph- 
based  property  are  more  likely  to  have  the  same  label.  Consider,  for  example,  a  professor 
X  and  two  other  entities  Y  and  Z.  If  X’s  webpage  mentions  Y  and  Z  in  the  same  context,  it 
is  likely  that  the  X-Y  relation  and  the  Y-Z  relation  are  of  the  same  type;  for  example,  if  Y 
is  Professor  X’s  advisee,  then  probably  so  is  Z.  Our  framework  accomodates  these  patterns 
easily,  by  introducing  pairwise  cliques  between  the  appropriate  relation  variables. 

Another  useful  type  of  subgraph  template  involves  transitivity  patterns,  where  the  pres¬ 
ence  of  an  A-B  link  and  of  a  B-C  link  increases  (or  decreases)  the  likelihood  of  an  A-C  link. 
For  example,  students  often  assist  in  courses  taught  by  their  advisor.  Note  that  this  type 
of  interaction  cannot  be  accounted  for  just  using  pairwise  cliques.  By  introducing  cliques 
over  triples  of  relations,  we  can  capture  such  patterns  as  well.  We  can  incorporate  even 
more  complicated  patterns,  but  of  course  we  are  limited  by  the  ability  of  belief  propagation 
to  scale  up  as  we  introduce  larger  cliques  and  tighter  loops  in  the  Markov  network. 

We  note  that  our  ability  to  model  these  more  complex  graph  patterns  relies  on  our  use 
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Figure  1 :  (a)  Relation  prediction  with  entity  labels  given.  Relational  models  on  average  performed 
better  than  the  baseline  Flat  model,  (b)  Entity  label  prediction.  Relational  model  Neigh  performed 
significantly  better,  (c)  Relation  prediction  without  entity  labels.  Relational  models  performed  better 
most  of  the  time,  even  though  there  are  schools  that  some  models  performed  worse. 

of  an  undirected  Markov  network  as  our  probabilistic  model.  In  contrast,  the  approach  of 
Getoor  el  al.  uses  directed  graphical  models  (Bayesian  networks  and  PRMs  [9])  to  repre¬ 
sent  a  probabilistic  model  of  both  relations  and  attributes.  Their  approach  easily  captures 
the  dependence  of  link  existence  on  attributes  of  entities.  But  the  constraint  that  the  prob¬ 
abilistic  dependency  graph  be  a  directed  acyclic  graph  makes  it  hard  to  see  how  we  would 
represent  the  subgraph  patterns  described  above.  For  example,  for  the  transitivity  pattern, 
we  might  consider  simply  directing  the  correlation  edges  between  link  existence  variables 
arbitrarily.  However,  it  is  not  clear  how  we  would  then  parameterize  a  link  existence  vari¬ 
able  for  a  link  that  is  involve  in  multiple  triangles.  See  [15]  for  further  discussion. 

5  Experiments  on  Web  Data 

We  collected  and  manually  labeled  a  new  relational  dataset  inspired  by  Web  KB  [2].  Our 
dataset  consists  of  Computer  Science  department  webpages  from  3  schools:  Stanford, 
Berkeley,  and  MIT.  A  total  of  2954  of  pages  are  labeled  into  one  of  eight  categories:  faculty, 
student,  research  scientist,  staff,  research  group,  research  project,  course  and  organization 
(organization  refers  to  any  large  entity  that  is  not  a  research  group).  Owned  pages,  which 
are  owned  by  an  entity  but  are  not  the  main  page  for  that  entity,  were  manually  assigned  to 
that  entity.  The  average  distribution  of  classes  across  schools  is:  organization  (9%),  student 
(40%),  research  group  (8%),  faculty  (11%),  course  (16%),  research  project  (7%),  research 
scientist  (5%),  and  staff  (3%). 

We  established  a  set  of  candidate  links  between  entities  based  on  evidence  of  a  relation 
between  them.  One  type  of  evidence  for  a  relation  is  a  hyperlink  from  an  entity  page  or  one 
of  its  owned  pages  to  the  page  of  another  entity.  A  second  type  of  evidence  is  a  virtual 
link :  We  assigned  a  number  of  aliases  to  each  page  using  the  page  title,  the  anchor  text  of 
incoming  links,  and  email  addresses  of  the  entity  involved.  Mentioning  an  alias  of  a  page 
on  another  page  constitutes  a  virtual  link.  The  resulting  set  of  7161  candidate  links  were 
labeled  as  corresponding  to  one  of  five  relation  types  —  Advisor  (faculty,  student).  Mem¬ 
ber  (research  group/project,  student/faculty/research  scientist).  Teach  (faculty/research  sci¬ 
entist/staff,  course),  TA  (student,  course),  Part-Of  (research  group,  research  proj)  —  or 
“none”,  denoting  that  the  link  does  not  correspond  to  any  of  these  relations. 

The  observed  attributes  for  each  page  are  the  words  on  the  page  itself  and  the  “meta¬ 
words”  on  the  page  —  the  words  in  the  title,  section  headings,  anchors  to  the  page  from 
other  pages.  For  links,  the  observed  attributes  are  the  anchor  text,  text  just  before  the  link 
(hyperlink  or  virtual  link),  and  the  heading  of  the  section  in  which  the  link  appears. 

Our  task  is  to  predict  the  relation  type,  if  any,  for  all  the  candidate  links.  We  tried  two 
settings  for  our  experiments:  with  page  categories  observed  (in  the  test  data)  and  page 
categories  unobserved.  For  all  our  experiments,  we  trained  on  two  schools  and  tested  on 
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the  remaining  school. 

Observed  Entity  Labels.  We  first  present  results  for  the  setting  with  observed  page  cat¬ 
egories.  Given  the  page  labels,  we  can  rule  out  many  impossible  relations;  the  resulting 
label  breakdown  among  the  candidate  links  is:  none  (38%),  member  (34%),  part-of  (4%), 
advisor  (11%),  teach  (9%),  TA  (5%). 

There  is  a  huge  range  of  possible  models  that  one  can  apply  to  this  task.  We  selected  a 
set  of  models  that  we  felt  represented  some  range  of  patterns  that  manifested  in  the  data. 

Link-Flat  is  our  baseline  model,  predicting  links  one  at  a  time  using  multinomial  lo¬ 
gistic  regression.  This  is  a  strong  classifier,  and  its  performance  is  competitive  with  other 
classifiers  (e.g.,  support  vector  machines).  The  features  used  by  this  model  are  the  labels  of 
the  two  linked  pages  and  the  words  on  the  links  going  from  one  page  and  its  owned  pages 
to  the  other  page.  The  number  of  features  is  around  1000. 

The  relational  models  try  to  improve  upon  the  baseline  model  by  modeling  the  interac¬ 
tions  between  relations  and  predicting  relations  jointly.  The  Section  model  introduces 
cliques  over  relations  whose  links  appear  consecutively  in  a  section  on  a  page.  This 
model  tries  to  capture  the  pattern  that  similarly  related  entities  (e.g.,  advisees,  members 
of  projects)  are  often  listed  together  on  a  webpage.  This  pattern  is  a  type  of  similarity 
template,  as  described  in  Section  4.  The  Triad  model  is  a  type  of  transitivity  template,  as 
discussed  in  Section  4.  Specifically,  we  introduce  cliques  over  sets  of  three  candidate  links 
that  form  a  triangle  in  the  link  graph.  The  Section  +  Triad  model  includes  the  cliques  of 
the  two  models  above. 

As  shown  in  Fig.  1(a),  both  the  Section  and  Triad  models  outperform  the  flat  model,  and 
the  combined  model  has  an  average  accuracy  gain  of  2.26%,  or  10.5%  relative  reduction  in 
error.  As  we  only  have  three  runs  (one  for  each  school),  we  cannot  meaningfully  analyze 
the  statistical  significance  of  this  improvement. 

As  an  example  of  the  interesting  inferences  made  by  the  models,  we  found  a  student- 
professor  pair  that  was  misclassified  by  the  Flat  model  as  none  (there  is  only  a  single 
hyperlink  from  the  student’s  page  to  the  advisor’s)  but  correctly  identified  by  both  the  Sec¬ 
tion  and  Triad  models.  The  Section  model  utilizes  a  paragraph  on  the  student’s  webpage 
describing  his  research,  with  a  section  of  links  to  his  research  groups  and  the  link  to  his 
advisor.  Examining  the  parameters  of  the  Section  model  clique,  we  found  that  the  model 
learned  that  it  is  likely  for  people  to  mention  their  research  groups  and  advisors  in  the  same 
section.  By  capturing  this  trend,  the  Section  model  is  able  to  increase  the  confidence  of  the 
student-advisor  relation.  The  Triad  model  corrects  the  same  misclassification  in  a  different 
way.  Using  the  same  example,  the  Triad  model  makes  use  of  the  information  that  both  the 
student  and  the  teacher  belong  to  the  same  research  group,  and  the  student  TAed  a  class 
taught  by  his  advisor.  It  is  important  to  note  that  none  of  the  other  relations  are  observed  in 
the  test  data,  but  rather  the  model  bootstraps  its  inferences. 

Unobserved  Entity  Labels.  When  the  labels  of  pages  are  not  known  during  relations 
prediction,  we  cannot  rule  out  possible  relations  for  candidate  links  based  on  the  labels  of 
participating  entities.  Thus,  we  have  many  more  candidate  links  that  do  not  correspond  to 
any  of  our  relation  types  (e.g.,  links  between  an  organization  and  a  student).  This  makes  the 
existence  of  relations  a  very  low  probability  event,  with  the  following  breakdown  among 
the  potential  relations:  none  (71%),  member  (16%),  part-of  (2%),  advisor  (5%),  teach  (4%), 
TA  (2%).  In  addition,  when  we  construct  a  Markov  network  in  which  page  labels  are  not 
observed,  the  network  is  much  larger  and  denser,  making  the  (approximate)  inference  task 
much  harder.  Thus,  in  addition  to  models  that  try  to  predict  page  entity  and  relation  labels 
simultaneously,  we  also  tried  a  two-phase  approach,  where  we  first  predict  page  categories, 
and  then  use  the  predicted  labels  as  features  for  the  model  that  predicts  relations. 

For  predicting  page  categories,  we  compared  two  models.  Entity-Flat  model  is  multi¬ 
nomial  logistic  regression  that  uses  words  and  “meta-words”  from  the  page  and  its  owned 
pages  in  separate  “bags”  of  words.  The  number  of  features  is  roughly  10, 000.  The  Neigh¬ 
bors  model  is  a  relational  model  that  exploits  another  type  of  similarity  template:  pages 
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Figure  2:  (a)  Average  precision/recall  breakeven  point  for  10%,  25%,  50%  observed  links,  (b) 
Average  precision/recall  breakeven  point  for  each  fold  of  school  residences  at  25%  observed  links. 


with  similar  urls  often  belong  to  the  same  category  or  tightly  linked  categories  (research 
group/project,  professor/course).  For  each  page,  two  pages  with  urls  closest  in  edit  dis¬ 
tance  are  selected  as  “neighbors”,  and  we  introduced  pairwise  cliques  between  “neighbor¬ 
ing”  pages.  Fig.  1(b)  shows  that  the  Neighbors  model  clearly  outperforms  the  Flat  model 
across  all  schools,  by  an  average  of  4.9%  accuracy  gain. 

Given  the  page  categories,  we  can  now  apply  the  different  models  for  link  classifica¬ 
tion.  Thus,  the  Phased  (Flat/Flat)  model  uses  the  Entity-Flat  model  to  classify  the  page 
labels,  and  then  the  Link-Flat  model  to  classify  the  candidate  links  using  the  resulting  en¬ 
tity  labels.  The  Phased  (Neighbors/Flat)  model  uses  the  Neighbors  model  to  classify 
the  entity  labels,  and  then  the  Link-Flat  model  to  classify  the  links.  The  Phased  (Neigh- 
bors/Section)  model  uses  the  Neighbors  to  classify  the  entity  labels  and  then  the  Section 
model  to  classify  the  links. 

We  also  tried  two  models  that  predict  page  and  relation  labels  simultaneously.  The 
Joint  +  Neighbors  model  is  simply  the  union  of  the  Neighbors  model  for  page  categories 
and  the  Flat  model  for  relation  labels  given  the  page  categories.  The  Joint  +  Neighbors 
+  Section  model  additionally  introduces  the  cliques  that  appeared  in  the  Section  model 
between  links  that  appear  consecutively  in  a  section  on  a  page.  We  train  the  joint  models 
to  predict  both  page  and  relation  labels  simultaneously. 

As  the  proportion  of  the  “none”  relation  is  so  large,  we  use  the  probability  of  “none”  to 
define  a  precision-recall  curve.  If  this  probability  is  less  than  some  threshold,  we  predict 
the  most  likely  label  (other  than  none),  otherwise  we  predict  the  most  likely  label  (includ¬ 
ing  none).  As  usual,  we  report  results  at  the  precision-recall  breakeven  point  on  the  test 
data.  Fig.  1(c)  show  the  breakeven  points  achieved  by  the  different  models  on  the  three 
schools.  Relational  models,  both  phased  and  joint,  did  better  than  flat  models  on  the  av¬ 
erage.  However,  performance  varies  from  school  to  school  and  for  both  joint  and  phased 
models,  performance  on  one  of  the  schools  is  worse  than  that  of  the  flat  model. 


6  Experiments  on  Social  Network  Data 

The  second  dataset  we  used  has  been  collected  by  a  portal  website  at  a  large  university  that 
hosts  an  online  community  for  students  [1].  Among  other  services,  it  allows  students  to 
enter  information  about  themselves,  create  lists  of  their  friends  and  browse  the  social  net¬ 
work.  Personal  information  includes  residence,  gender,  major  and  year,  as  well  as  favorite 
sports,  music,  books,  social  activities,  etc.  We  focused  on  the  task  of  predicting  the  “friend¬ 
ship”  links  between  students  from  their  personal  information  and  a  subset  of  their  links.  We 
selected  students  living  in  sixteen  different  residences  or  dorms  and  restricted  the  data  to 
the  friendship  links  only  within  each  residence,  eliminating  inter-residence  links  from  the 
data  to  generate  independent  training/test  splits.  Each  residence  has  about  15-25  students 
and  an  average  student  lists  about  25%  of  his  or  her  house-mates  as  friends. 

We  used  an  eight-fold  train-test  split,  where  we  trained  on  fourteen  residences  and  tested 
on  two.  Predicting  links  between  two  students  from  just  personal  information  alone  is  a 
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very  difficult  task,  so  we  tried  a  more  realistic  setting,  where  some  proportion  of  the  links 
is  observed  in  the  test  data,  and  can  be  used  as  evidence  for  predicting  the  remaining  links. 
We  used  the  following  proportions  of  observed  links  in  the  test  data:  10%,  25%,  and  50%. 
The  observed  links  were  selected  at  random,  and  the  results  we  report  are  averaged  over 
five  folds  of  these  random  selection  trials. 

Using  just  the  observed  portion  of  links,  we  constructed  the  following  flat  features:  for 
each  student,  the  proportion  of  students  in  the  residence  that  list  him/her  and  the  proportion 
of  students  he/she  lists;  for  each  pair  of  students,  the  proportion  of  other  students  they  have 
as  common  friends.  The  values  of  the  proportions  were  discretized  into  four  bins.  These 
features  capture  some  of  the  relational  structure  and  dependencies  between  links:  Students 
who  list  (or  are  listed  by)  many  friends  in  the  observed  portion  of  the  links  tend  to  have  links 
in  the  unobserved  portion  as  well.  More  importantly,  having  friends  in  common  increases 
the  likelihood  of  a  link  between  a  pair  of  students. 

The  Flat  model  uses  logistic  regression  with  the  above  features  as  well  as  personal 
information  about  each  user.  In  addition  to  individual  characteristics  of  the  two  people,  we 
also  introduced  a  feature  for  each  match  of  a  characteristic,  for  example,  both  people  are 
computer  science  majors  or  both  are  freshmen. 

The  Compatibility  model  uses  a  type  of  similarity  template,  introducing  cliques  be¬ 
tween  each  pair  of  links  emanating  from  each  person.  Similarly  to  the  Flat  model,  these 
cliques  include  a  feature  for  each  match  of  the  characteristics  of  the  two  potential  friends. 
This  model  captures  the  tendency  of  a  person  to  have  friends  who  share  many  character¬ 
istics  (even  though  the  person  might  not  possess  them).  For  example,  a  student  may  be 
friends  with  several  CS  majors,  even  though  he  is  not  a  CS  major  himself.  We  also  tried 
models  that  used  transitivity  templates,  but  the  approximate  inference  with  3-cliques  often 
failed  to  converge  or  produced  erratic  results. 

Fig.  2(a)  compares  the  average  precision/recall  breakpoint  achieved  by  the  different 
models  at  the  three  different  settings  of  observed  links.  Fig.  2(b)  shows  the  performance 
on  each  of  the  eight  folds  containing  two  residences  each.  Using  a  paired  t-test,  the  Com¬ 
patibility  model  outperforms  Flat  with  p-values  0.0036,  0.00064  and  0.054  respectively. 

7  Discussion  and  Conclusions 

In  this  paper,  we  consider  the  problem  of  link  prediction  in  relational  domains.  We  focus 
on  the  task  of  collective  link  classification,  where  we  are  simultaneously  trying  to  predict 
and  classify  an  entire  set  of  links  in  a  link  graph.  We  show  that  the  use  of  a  probabilistic 
model  over  link  graphs  allows  us  to  represent  and  exploit  interesting  subgraph  patterns  in 
the  link  graph.  Specifically,  we  have  found  two  types  of  patterns  that  seem  to  be  beneficial 
in  several  places.  Similarity  templates  relate  the  classification  of  links  or  objects  that  share 
a  certain  graph-based  property  (e.g.,  links  that  share  a  common  endpoint).  Transitivity 
templates  relate  triples  of  objects  and  links  organized  in  a  triangle.  We  show  that  the  use  of 
these  patterns  significantly  improve  the  classification  accuracy  over  flat  models. 

Relational  Markov  networks  are  not  the  only  method  one  might  consider  applying  to  the 
link  prediction  and  classification  task.  We  could,  for  example,  build  a  link  predictor  that 
considers  other  links  in  the  graph  by  converting  graph  features  into  flat  features  [11],  as 
we  did  in  the  social  network  data.  As  our  experiments  show,  even  with  these  features,  the 
collective  prediction  approach  work  better.  Another  approach  is  to  use  relational  classifiers 
such  as  variants  of  inductive  logic  programming  [10].  Generally,  however,  these  methods 
have  been  applied  to  the  problem  of  predicting  or  classifying  a  single  link  at  a  time.  It  is 
not  clear  how  well  they  would  extend  to  the  task  of  simultaneously  predicting  an  entire  link 
graph.  Finally,  we  could  apply  the  directed  PRM  framework  of  [5].  However,  as  shown 
in  [15],  the  discriminatively  trained  RMNs  perform  significantly  better  than  generatively 
trained  PRMs  even  on  the  simpler  entity  classification  task.  Furthermore,  as  we  discussed, 
the  PRM  framework  cannot  represent  (in  any  natural  way)  the  type  of  subgraph  patterns 
that  seem  prevalent  in  link  graph  data.  Therefore,  the  RMN  framework  seems  much  more 
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appropriate  for  this  task. 

Although  the  RMN  framework  worked  fairly  well  on  this  task,  there  is  significant  room 
for  improvement.  One  of  the  key  problems  limiting  the  applicability  of  approach  is  the 
reliance  on  belief  propagation,  which  often  does  not  converge  in  more  complex  problems. 
This  problem  is  especially  acute  in  the  link  prediction  problem,  where  the  presence  of  all 
potential  links  leads  to  densely  connected  Markov  networks  with  many  short  loops.  This 
problem  can  be  addressed  with  heuristics  that  focus  the  search  on  links  that  are  plausible 
(as  we  did  in  a  very  simple  way  in  the  webpage  experiments).  A  more  interesting  solution 
would  be  to  develop  a  more  integrated  approximate  inference  /  learning  algorithm. 

Our  results  use  a  set  of  relational  patterns  that  we  have  discovered  to  be  useful  in  the 
domains  that  we  have  considered.  However,  many  other  rich  and  interesting  patterns  are 
possible.  Thus,  in  the  relational  setting,  even  more  so  than  in  simpler  tasks,  the  issue  of 
feature  construction  is  critical.  It  is  therefore  important  to  explore  the  problem  of  automatic 
feature  induction,  as  in  [3], 

Finally,  we  believe  that  the  problem  of  modeling  link  graphs  has  numerous  other  ap¬ 
plications,  including:  analyzing  communities  of  people  and  hierarchical  structure  of  orga¬ 
nizations,  identifying  people  or  objects  that  play  certain  key  roles,  predicting  current  and 
future  interactions,  and  more. 
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Abstract 

Markov  networks  are  extensively  used  to  model 
complex  sequential,  spatial,  and  relational  in¬ 
teractions  in  fields  as  diverse  as  image  process¬ 
ing,  natural  language  analysis,  and  bioinformat¬ 
ics.  However,  inference  and  learning  in  general 
Markov  networks  is  intractable.  In  this  paper,  we 
focus  on  learning  a  large  subclass  of  such  mod¬ 
els  (called  associative  Markov  networks)  that  are 
tractable  or  closely  approximable.  This  subclass 
contains  networks  of  discrete  variables  with  K 
labels  each  and  clique  potentials  that  favor  the 
same  labels  for  all  variables  in  the  clique.  Such 
networks  capture  the  “guilt  by  association”  pat¬ 
tern  of  reasoning  present  in  many  domains,  in 
which  connected  ( “associated” )  variables  tend  to 
have  the  same  label.  Our  approach  exploits  a  lin¬ 
ear  programming  relaxation  for  the  task  of  find¬ 
ing  the  best  joint  assignment  in  such  networks, 
which  provides  an  approximate  quadratic  pro¬ 
gram  (QP)  for  the  problem  of  learning  a  margin- 
maximizing  Markov  network.  We  show  that  for 
associative  Markov  network  over  binary-valued 
variables,  this  approximate  QP  is  guaranteed  to 
return  an  optimal  parameterization  for  Markov 
networks  of  arbitrary  topology.  For  the  non¬ 
binary  case,  optimality  is  not  guaranteed,  but 
the  relaxation  produces  good  solutions  in  prac¬ 
tice.  Experimental  results  with  hypertext  and 
newswire  classification  show  significant  advan¬ 
tages  over  standard  approaches. 


1.  Introduction 

Numerous  classification  methods  have  been  devel¬ 
oped  for  the  principal  machine  learning  problem  of 
assigning  to  a  single  object  one  of  K  labels  consis¬ 
tent  with  its  properties.  Many  classification  problems, 
however,  involve  sets  of  related  objects  whose  labels 
must  also  be  consistent  with  each  other.  In  hypertext 
or  bibliographic  classification,  labels  of  linked  and  co¬ 
cited  documents  tend  to  be  similar  (Chakrabarti  et  al., 
1998;  Taskar  et  al.,  2002).  In  proteomic  analysis,  lo- 
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cation  and  function  of  proteins  that  interact  are  often 
highly  correlated  (Vazquez  et  al.,  2003).  In  image  pro¬ 
cessing,  neighboring  pixels  exhibit  local  label  coher¬ 
ence  in  denoising,  segmentation  and  stereo  correspon¬ 
dence  (Besag,  1986;  Boykov  et  al.,  1999a). 

Markov  networks  compactly  represent  complex 
joint  distributions  of  the  label  variables  by  modeling 
their  local  interactions.  Such  models  are  encoded  by  a 
graph,  whose  nodes  represent  the  different  object  la¬ 
bels,  and  whose  edges  represent  direct  dependencies 
between  them.  For  example,  a  Markov  network  for 
the  hypertext  domain  would  include  a  node  for  each 
webpage,  encoding  its  label,  and  an  edge  between  any 
pair  of  webpages  whose  labels  are  directly  correlated 
(e.g.,  because  one  links  to  the  other). 

There  has  been  growing  interest  in  training  Markov 
networks  for  the  purpose  of  collectively  classifying 
sets  of  related  instances.  The  focus  has  been  on  dis¬ 
criminative  training,  which,  given  enough  data,  gen¬ 
erally  provides  significant  improvements  in  classifica¬ 
tion  accuracy  over  generative  training.  For  example, 
Markov  networks  can  be  trained  to  maximize  the  con¬ 
ditional  likelihood  of  the  labels  given  the  features  of 
the  objects  (Lafferty  et  al.,  2001;  Taskar  et  al.,  2002). 
Recently,  maximum  margin-based  training  has  been 
shown  to  additionally  boost  accuracy  over  conditional 
likelihood  methods  and  allow  a  seamless  integration  of 
kernel  methods  with  Markov  networks  (Taskar  et  al., 
2003a). 

The  chief  computational  bottleneck  in  this  task  is 
inference  in  the  underlying  network,  which  is  a  core 
subroutine  for  all  methods  for  training  Markov  net¬ 
works.  Probabilistic  inference  is  NP-hard  in  general, 
and  requires  exponential  time  in  a  broad  range  of 
practical  Markov  network  structures,  including  grid- 
topology  networks  (Besag,  1986).  One  can  address  the 
tractability  issue  by  limiting  the  structure  of  the  un¬ 
derlying  network.  In  some  cases,  such  as  the  the  quad¬ 
tree  model  used  for  image  segmentation  (Bouman  & 
Shapiro,  1994),  a  tractable  structure  is  determined  in 
advance.  In  other  cases  (e.g.,  (Bach  &  Jordan,  2001)), 
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the  network  structure  is  learned,  subject  to  the  con¬ 
straint  that  inference  on  these  networks  is  tractable. 
In  many  cases,  however,  the  topology  of  the  Markov 
network  does  not  allow  tractable  inference.  In  the  hy¬ 
pertext  domain,  the  network  structure  mirrors  the  hy¬ 
perlink  graph,  which  is  usually  highly  interconnected, 
leading  to  computationally  intractable  networks. 

In  this  paper,  we  show  that  optimal  learning  is  fea¬ 
sible  for  an  important  subclass  of  Markov  networks 
—  networks  with  attractive  potentials.  This  subclass, 
which  we  call  associative  Markov  networks  (AMNs), 
contains  networks  of  discrete  variables  with  K  labels 
each  and  arbitrary-size  clique  potentials  with  K  pa¬ 
rameters  that  favor  the  same  label  for  all  variables 
in  the  clique.  Such  positive  interactions  capture  the 
“guilt  by  association”  pattern  of  reasoning  present 
in  many  domains,  in  which  connected  (“associated”) 
variables  tend  to  have  the  same  label.  AMNs  are  a 
natural  fit  for  object  recognition  and  segmentation, 
webpage  classification,  and  many  other  applications. 

Our  analysis  is  based  on  the  maximum  margin 
approach  to  training  Markov  networks,  presented  by 
Taskar  et  al.  (2003a).  In  this  formulation,  the  learn¬ 
ing  task  is  to  find  the  Markov  network  parameteriza¬ 
tion  that  achieves  the  highest  confidence  in  the  target 
labels.  In  other  words,  the  goal  is  to  maximize  the 
margin  between  the  target  labels  and  any  other  label 
assignment.  The  inference  subtask  in  this  formulation 
of  the  learning  problem  is  one  of  finding  the  best  joint 
(MAP)  assignment  to  all  of  the  variables  in  a  Markov 
network.  By  contrast,  other  learning  tasks  (e.g.,  max¬ 
imizing  the  conditional  likelihood  of  the  target  labels 
given  the  features)  often  require  that  we  compute  the 
posterior  probabilities  of  different  label  assignments, 
rather  than  just  the  MAP. 

The  MAP  problem  can  naturally  be  expressed  as 
an  integer  programming  problem.  We  show  how  we 
can  approximate  the  maximum  margin  Markov  net¬ 
work  learning  task  as  a  quadratic  program  that  uses  a 
linear  program  (LP)  relaxation  of  this  integer  program. 
This  quadratic  program  can  be  solved  in  polynomial 
time  using  standard  techniques.  We  show  that  when¬ 
ever  the  MAP  LP  relaxation  is  guaranteed  to  return 
integer  solutions,  the  approximate  max-margin  QP 
provides  an  optimal  solution  to  the  max-margin  op¬ 
timization  task.  In  particular,  for  associative  Markov 
networks  over  binary  variables  ( K  =  2),  this  linear 
program  provides  exact  answers.  For  the  non-binary 
case  ( K  >  2),  the  approximate  quadratic  program  is 
not  guaranteed  to  be  optimal,  but  our  empirical  re¬ 
sults  suggest  that  the  solutions  work  well  in  practice. 
To  our  knowledge,  our  method  is  the  first  to  allow 
training  Markov  networks  of  arbitrary  topology. 


2.  Markov  Networks 

We  restrict  attention  to  networks  over  discrete  vari¬ 
ables  Y  =  {Yi,...,Yv},  where  each  variable  corre¬ 
sponds  to  an  object  we  wish  to  classify  and  has  K 
possible  labels:  Y  £  {1  An  assignment  of 

values  to  Y  is  denoted  by  y.  A  Markov  network  for  Y 
defines  a  joint  distribution  over  {1, . . . ,  K}N . 

A  Markov  network  is  defined  by  an  undirected 
graph  over  the  nodes  Y  =  {Yu  •  •  • ,  Yv}-  In  general,  a 
Markov  network  is  a  set  of  cliques  C,  where  each  clique 
c  £  C  is  associated  with  a  subset  Yc  of  Y.  The  nodes 
Y  in  a  clique  c  form  a  fully  connected  subgraph  (a 
clique)  in  the  Markov  network  graph.  Each  clique  is 
accompanied  by  a  potential  <j)c(Yc),  which  associates  a 
non-negative  value  with  each  assignment  yc  to  Yc.  The 
Markov  network  defines  the  probability  distribution: 

p<t>{y)  =  7?  II  Myc) 

cec 

where  Z  is  the  partition  function  given  by  Z  = 

Ey'  YlcecMyc'). 

For  simplicity  of  exposition,  we  focus  most  of  our 
discussion  on  pairwise  Markov  networks.  We  extend 
our  results  to  higher-order  interactions  in  Sec.  3.  A 
pairwise  Markov  network  is  simply  a  Markov  network 
where  all  of  the  cliques  involve  either  a  single  node  or 
a  pair  of  nodes.  Thus,  in  a  pairwise  Markov  network 
with  edges  E  =  {(ij)}  (*  <  j),  only  nodes  and  edges 
are  associated  with  potentials  ^i(Y)  and  (pij(Yt.  Yf). 
A  pairwise  Markov  net  defines  the  distribution 

1  N 

Yjj(y)  =  <; hj(yuyj ), 

i=1  (dies 

where  Z  is  the  partition  function  given  by  Z  = 

Ey'  Ylh  Myd  nW)6B  &■ 

The  node  and  edge  potentials  are  functions  of  the 
features  of  the  objects  x,  £  and  features  of  the  re¬ 
lationships  between  them  x.(j  £  3 ftde .  In  hypertext  clas¬ 
sification,  x,  might  be  the  counts  of  the  words  of  the 
document  i,  while  x^  might  be  the  words  surround¬ 
ing  the  hyperlink (s)  between  documents  i  and  j.  The 
simplest  model  of  dependence  of  the  potentials  on  the 
features  is  a  log-linear  combination:  log  4>i(k )  =  ■  x, 

and  log  <j>ij(k,l)  =  Wj’1  •  x^-,  where  w*  and  are 
label-specific  row  vectors  of  node  and  edge  parameters, 
of  size  dn  and  de,  respectively.  Note  that  this  formula¬ 
tion  assumes  that  all  of  the  nodes  in  the  network  share 
the  same  set  of  weights,  and  similarly  all  of  the  edges 
share  the  same  weights. 

We  represent  an  assignment  y  as  a  set  of  K  ■  N  in¬ 
dicators  {Ui},  where  y\  =  I(yi  =  k).  With  these  defi¬ 
nitions,  the  log  of  conditional  probability  log  Pw(y  |  x) 
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is  given  by: 

N  K  K 

^2^2(Wn-^i)yi+  Y  Y  (We''-Xii)^fe^— logZw(x)- 

*=lfe=l  (ij)EEk,l= 1 

Note  that  the  partition  function  Zw(x)  above  depends 
on  the  parameters  w  and  input  features  x,  but  not  on 
the  labels  y^s. 

For  compactness  of  notation,  we  define  the  node 
and  edge  weight  vectors  w„  =  (w*, . . . ,  w* )  and 
we  =  (wj;’1, . . . ,  w(?’K),  and  let  w  =  (w„,we)  be 
a  vector  of  all  the  weights,  of  size  d  =  Kdn  + 
K2de.  Also,  we  define  the  node  and  edge  la¬ 
bels  vectors,  y„  =  (•  •  • ,  yj,  ■  ■  ■ ,  yf ,  ■  ■  -)T  and  ye  = 

(•  ■  • ,  vlj1,  ■  ■  ■ ,  yfj’K i  •  ■  • )T ,  where  y\f  =  y\y\,  and  the 
vector  of  all  labels  y  =  (yn,ye)  of  size  L  =  KN  + 
K2\E\.  Finally,  we  define  an  appropriate  dx  L  matrix 
X  such  that 

logPw(y  |  x)  =  wXy  -  logZw(x). 

The  matrix  X  contains  the  node  feature  vectors  Xj  and 
edge  feature  vectors  x.y  repeated  multiple  times  (for 
each  label  k  or  label  pair  k,  l  respectively),  and  padded 
with  zeros  appropriately. 

A  key  task  in  Markov  networks  is  computing  the 
MAP  (maximum  a  posteriori)  assignment  —  the  as¬ 
signment  y  that  maximizes  logPw(y  x).  It  is 
straightforward  to  formulate  the  MAP  inference  task 
as  an  integer  linear  program:  The  variables  are  the  as¬ 
signments  to  the  nodes  y(  and  edges  yxl  which  must  be 
in  the  set  {0,1},  and  satisfy  linear  normalization  and 
agreement  constraints.  The  optimization  criterion  is 
simply  the  linear  function  wXy,  which  acorresponds 
to  the  log  of  the  unnormalized  probability  of  the  as¬ 
signment  y. 

In  certain  cases,  we  can  take  this  integer  program, 
and  approximate  it  as  a  linear  program  by  relaxing 
the  integrality  constraints  on  y( ,  with  appropriate  con¬ 
straints.  For  example,  Wainwright  et  al.  (2002)  pro¬ 
vides  a  natural  formulation  of  this  form  that  is  guar¬ 
anteed  to  produce  integral  solutions  for  triangulated 
graphs. 

3.  Associative  Markov  Networks 

We  now  describe  one  important  subclass  of  prob¬ 
lems  for  which  the  above  relaxation  is  particularly  use¬ 
ful.  These  networks,  which  we  call  associative  Markov 
networks  (AMNs),  encode  situations  where  related 
variables  tend  to  have  the  same  value. 

Associative  interactions  arise  naturally  in  the  con¬ 
text  of  image  processing,  where  nearby  pixels  are  likely 
to  have  the  same  label  (Besag,  1986;  Boykov  et  al., 
1999b).  In  this  setting,  a  common  approach  is  to  use  a 


generalized  Potts  model  (Potts,  1952),  which  penalizes 
assignments  that  do  not  have  the  same  label  across  the 
edge:  <j>ij(k,l)  =  A ij,  Vfc  yf  l  and  <f>ij(k,k)  =  1,  where 
A,,  <  1. 

For  binary-valued  Potts  models,  Greig  et  al.  (1989) 
show  that  the  MAP  problem  can  be  formulated  as  a 
min-cut  in  an  appropriately  constructed  graph.  Thus, 
the  MAP  problem  can  be  solved  exactly  for  this  class  of 
models  in  polynomial  time.  For  K  >  2,  the  MAP  prob¬ 
lem  is  NP-hard,  but  a  procedure  based  on  a  relaxed 
linear  program  guarantees  a  factor  2  approximation  of 
the  optimal  solution  (Boykov  et  al.,  1999b;  Kleinberg 
&  Tardos,  1999).  Kleinberg  and  Tardos  (1999)  extend 
the  multi-class  Potts  model  to  have  more  general  edge 
potentials,  under  the  constraints  that  negative  log  po¬ 
tentials  —  log  <pij  (k,l)  form  a  metric  on  the  set  of  la¬ 
bels.  They  also  provide  a  solution  based  on  a  relaxed 
LP  that  has  certain  approximation  guarantees. 

More  recently,  Kolmogorov  and  Zabih  (2002) 
showed  how  to  optimize  energy  functions  containing 
binary  and  ternary  interactions  using  graph  cuts,  as 
long  as  the  parameters  satisfy  a  certain  regularity  con¬ 
dition.  Our  definition  of  associative  potentials  below 
also  satisfies  the  Kolmogorov  and  Zabih  regularity  con¬ 
dition  for  K  =  2.  However,  the  structure  of  our  poten¬ 
tials  is  simpler  to  describe  and  extend  for  the  multi¬ 
class  case.  We  use  a  linear  programming  formulation 
(instead  of  min-cut)  for  the  MAP  inference,  which  al¬ 
lows  us  to  use  the  maximum  margin  estimation  frame¬ 
work,  as  described  below.  Note  however,  that  we  can 
also  use  min-cut  to  perform  exact  inference  on  the 
learned  models  for  K  —  2  and  also  in  approximate 
inference  for  K  >  2  as  in  Boykov  et  al.  (1999a). 

Our  associative  potentials  extend  the  Potts  model 
in  several  ways.  Importantly,  AMNs  allow  different  la¬ 
bels  to  have  different  attraction  strength:  cf>ij(k,k )  = 
A}},  where  A})  >  1,  and  4>ij(k,l)  =  1,  Vfc  yf  l.  This 
additional  flexibility  is  important  in  many  domains, 
as  different  labels  can  have  very  diverse  affinities.  For 
example,  foreground  pixels  tend  to  have  locally  coher¬ 
ent  values  while  background  is  much  more  varied. 

The  linear  programming  relaxation  of  the  MAP 
problem  for  these  networks  can  be  written  as: 

N  K  K 

max  n-xi)vi+  Y  Y(wek  •  *«)»«  w 

*=i  fc=i  {ij)eEk=  l 

s.t.  2/f>0,  Vz,fc;  £z/f  =  l,  Vi; 

k 

Vij  <  vt  Vij  <  Vj,  V(ij)  G  E,  k. 

Note  that  we  substitute  the  constraint  y ^  =  z/f  A  y * 
by  two  linear  constraints  y(3  <  y\  and  y(:j  <  y1). 
This  works  because  the  coefficient  w}1  •  x.(J  is  non- 
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negative  and  we  are  maximizing  the  objective  func¬ 
tion.  Hence, at  the  optimum  j/V  =  min (yf,yj)  ,  which 
is  equivalent  to  y ^  =  y *  A  y 

In  a  second  important  extension,  AMNs  admit  non- 
pairwise  interactions  between  variables,  with  poten¬ 
tials  over  cliques  involving  m  variables  4>(yn, . . . ,  y.;TO). 
In  this  case,  the  clique  potentials  are  constrained  to 
have  the  same  type  of  structure  as  the  edge  poten¬ 
tials:  There  are  K  parameters  cf(k, . . . ,  k)  =  and 
the  rest  of  the  entries  are  set  to  1.  In  particular,  using 
this  additional  expressive  power,  AMNs  allow  us  to  en¬ 
code  the  pattern  of  (soft)  transitivity  present  in  many 
domains.  For  example,  consider  the  problem  of  pre¬ 
dicting  whether  two  proteins  interact  (Vazquez  et  al., 
2003);  this  probability  may  increase  if  they  both  in¬ 
teract  with  another  protein.  This  type  of  transitivity 
could  be  modeled  by  a  ternary  clique  that  has  high  A 
for  the  assignment  with  all  interactions  present. 

We  can  write  a  linear  program  for  the  MAP  prob¬ 
lem  similar  to  Eq.  (1),  where  we  have  a  variable  y (f  for 
each  clique  c  and  for  each  label  k,  which  represents  the 
event  that  all  nodes  in  the  clique  c  have  label  k: 

N  K  K 

max  '  xc)Vc  (2) 

i= 1  k— 1  cGC  k— 1 

s.t.  Vi>0,  Vi,  k;  =  1,  Vi; 

k 

Vc<Vi ,  Vc  £  C,  i  £  c,  k. 

It  can  be  shown  that  in  the  binary  case,  the  relaxed 
linear  programs  Eq.  (1)  and  Eq.  (2)  are  guaranteed  to 
produce  an  integer  solution  when  a  unique  solution 
exists. 

Theorem  3.1  If  K  =  2,  for  any  objective  wX,  the 
linear  programs  in  Eq.  (1)  and  Eq.  (2)  have  an  integral 
optimal  solution. 

See  appendix  for  the  proof.  This  result  states  that  the 
MAP  problem  in  binary  AMNs  is  tractable,  regardless 
of  network  topology  or  clique  size.  In  the  non-binary 
case  ( K  >  2),  these  LPs  can  produce  fractional  so¬ 
lutions  and  we  use  a  rounding  procedure  to  get  an 
integral  solution.  In  the  appendix,  we  also  show  that 
the  approximation  ratio  of  the  rounding  procedure  is 
the  inverse  of  the  size  of  the  largest  clique  (e.g.,  \  for 
pairwise  networks).  Although  artificial  examples  with 
fractional  solutions  can  be  easily  constructed  by  using 
symmetry,  it  seems  that  in  real  data  such  symmetries 
are  often  broken.  In  fact,  in  all  our  experiments  with 
K  >  2  on  real  data,  we  never  encountered  fractional 
solutions. 


4.  Max  Margin  Estimation 

We  now  consider  the  problem  of  training  the 
weights  w  of  a  Markov  network  given  a  labeled  train¬ 
ing  instance  (x,y).  For  simplicity  of  exposition,  we 
assume  that  we  have  only  a  single  training  instance; 
the  extension  to  the  case  of  multiple  instances  is  en¬ 
tirely  straightforward.  Note  that,  in  our  setting,  a 
single  training  instance  actually  contains  multiple  ob¬ 
jects.  For  example,  in  the  hypertext  domain,  an  in¬ 
stance  might  be  an  entire  website,  containing  many 
inter-linked  webpages. 

The  M3N  Framework.  The  standard  approach 
of  learning  the  weights  w  given  (x,  y)  is  to  maximize 
the  logPw(y  |  x),  with  an  additional  regularization 
term,  which  is  usually  taken  to  be  the  squared-norm 
of  the  weights  w  (Lafferty  et  al.,  2001).  An  alternative 
method,  recently  proposed  by  Taskar  et  al.  (2003a),  is 
to  maximize  the  margin  of  confidence  in  the  true  la¬ 
bel  assignment  y  over  any  other  assignment  y  /  y. 
They  show  that  the  margin-maximization  criterion 
provides  significant  improvements  in  accuracy  over  a 
range  of  problems.  It  also  allows  high-dimensional  fea¬ 
ture  spaces  to  be  utilized  by  using  the  kernel  trick,  as 
in  support  vector  machines.  The  maximum  margin 
Markov  network  (M3N)  framework  forms  the  basis  for 
our  work,  so  we  begin  by  reviewing  this  approach. 

As  in  support  vector  machines,  the  goal  in  an  M3N 
is  to  maximize  our  confidence  in  the  true  labels  y  rela¬ 
tive  to  any  other  possible  joint  labelling  y.  Specifically, 
we  define  the  gain  of  the  true  labels  y  over  another 
possible  joint  labelling  y  as: 

logPw(y  |  x)  -  logPw(y  |  x)  =  wX(y  -  y). 

In  M3Ns,  the  desired  gain  takes  into  account  the  num¬ 
ber  of  labels  in  y  that  are  misclassified,  A(y,y),  by 
scaling  linearly  with  it: 

max  7  s.t.  wX(y  -  y)  >  yA(y,  y);  ||w||2  <  1. 

Note  that  the  number  of  incorrect  node  labels  A(y,  y) 
can  also  be  written  as  N  —  y„yn.  (Whenever  y,;  and 
yi  agree  on  some  label  k,  we  have  that  y.f  =  1  and 
Pi  =  1,  adding  1  to  yjyn.)  By  dividing  through  by  7 
and  adding  a  slack  variable  for  non-separable  data,  we 
obtain  a  quadratic  program  (QP)  with  exponentially 
many  constraints: 

min  ±||w||2  +  C7£  (3) 

s.t.  wX(y-y)  >  N-y^yn  -  Vy  £  y. 

This  QP  has  a  constraint  for  every  possible  joint  as¬ 
signment  y  to  the  Markov  network  variables,  resulting 
in  an  exponentially-sized  QP.  Taskar  et  al.  show  how 
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Figure  1.  Exact  and  approximate  constraints  on  the  max- 
margin  quadratic  program.  The  solid  red  line  represents 
the  constraints  imposed  by  integer  y’s,  whereas  the  dashed 
blue  line  represents  the  stronger  constraints  imposed  by  the 
larger  set  of  fractional  y’s.  The  fractional  constraints  may 
coincide  with  the  integer  constraints  in  some  cases,  and  be 
more  stringent  in  others.  The  parabolic  contours  represent 
the  value  of  the  objective  function. 

structure  in  the  dual  of  this  QP  can  be  exploited  to  al¬ 
low  an  efficient  solution  when  the  underlying  network 
has  low  tree  width. 

M3N  relaxations. 

As  an  alternative  to  the  approach  of  Taskar  et  al. , 
we  now  derive  a  more  generally  applicable  approach 
for  exploiting  structure  and  relaxations  in  max-margin 
problems.  As  our  first  step,  we  replace  the  exponen¬ 
tial  set  of  linear  constraints  in  the  max-margin  QP 
of  Eq.  (3)  with  the  single  equivalent  non-linear  con¬ 
straint: 

wXy  -  N  +  £>  max  wXy  -  y^y„. 

ye  y 

This  non-linear  constraint  essentially  requires  that  we 
find  the  assignment  y  to  the  network  variables  which 
has  the  highest  probability  relative  to  the  parameter¬ 
ization  wX  —  yj,  ■  Thus,  optimizing  the  max-margin 
QP  contains  the  MAP  inference  task  as  a  component. 

As  we  discussed  earlier,  we  can  formulate  the  MAP 
problem  as  an  integer  program,  and  then  relax  it  into 
a  linear  program.  Inserting  the  relaxed  LP  into  the 
QP  of  Eq.  (3),  we  obtain: 

min  ^llw||2  +  C£  (4) 

s.t.  wXy  -  N  +  £  >  max  wXy-y^y„. 

y  eT' 

where  y'  is  the  space  of  all  legal  fractional  values  for 
y.  In  effect,  we  obtain  a  QP  with  a  continuum  of 
constraints,  one  for  every  fractional  assignment  to  y. 

It  follows  that,  in  cases  where  the  relaxed  LP  is 
guaranteed  to  provide  integer  solutions,  the  integer 
and  relaxed  constraint  sets  coincide,  so  that  the  ap¬ 
proximate  QP  is  computing  precisely  the  optimal  max- 
margin  solution.  In  the  general  case,  the  linear  re¬ 
laxation  strengthens  the  constraints  on  w  by  poten¬ 
tially  adding  constraints  corresponding  to  fractional 
assignments  y.  Fig.  1  shows  how  the  relaxation  of 


the  max  subproblem  reduces  the  feasible  space  of  w 
and  £.  Note  that  for  every  setting  of  the  weights  w 
that  produces  fractional  solutions  for  the  LP  relax¬ 
ation,  the  approximate  constraints  are  tightened  be¬ 
cause  of  the  additional  fractional  assignments  y.  In 
this  case,  the  fractional  MAP  solution  is  better  than 
any  integer  solution,  including  y,  thereby  driving  up 
the  corresponding  slack  £.  By  contrast,  for  weights  w 
for  which  the  MAP  LP  is  integer- valued,  the  margin 
has  the  standard  interpretation  as  the  difference  be¬ 
tween  the  probability  of  y  and  the  MAP  y  (according 
tow).  As  the  objective  includes  a  penalty  for  the  slack 
variable,  intuitively,  minimizing  the  objective  tends  to 
drive  the  weights  w  away  from  the  regions  where  the 
solutions  to  the  MAP  LP  are  fractional. 

While  this  insight  allows  us  to  replace  the  MAP 
integer  program  within  the  QP  with  a  linear  program, 
the  resulting  QP  does  not  appear  tractable.  However, 
here  we  can  exploit  fundamental  properties  of  linear 
programming  duality  (Bertsimas  &  Tsitsiklis,  1997). 
Assume  that  our  relaxed  LP  for  the  inference  task  has 
the  form: 

max  wBy  s.t.  y  >  0,  Ay  <  b.  (5) 
y 

for  some  polynomial-size  A,B,b.  (For  example, 
Eq.  (1)  and  Eq.  (2)  can  be  easily  written  in  this  com¬ 
pact  form.)  The  dual  of  this  LP  is  given  by: 

min  bTz  s.t.  z  >  0,  ATz  >  (wB)T.  (6) 

Z 

When  the  relaxed  LP  is  feasible  and  bounded,  the 
value  of  Eq.  (6)  provides  an  upper  bound  on  the  pri¬ 
mal  that  achieves  the  same  value  as  the  primal  at  its 
minimum.  If  we  substitute  Eq.  (6)  for  Eq.  (5)  in  the 
QP  of  Eq.  (4),  we  obtain  a  quadratic  program  over  w, 
£  and  z  with  polynomially  many  linear  constraints: 

min  i||w||2+C£  (7) 

s.t.  wXy  —  N  +  £  >  bTz; 

z  >  0,  Atz  >  (wB)t. 

Our  ability  to  perform  this  transformation  is  a  di¬ 
rect  consequence  of  the  connection  between  the  max- 
margin  criterion  and  the  MAP  inference  problem.  The 
transformation  is  useful  whenever  we  can  solve  or  ap¬ 
proximate  MAP  using  a  compact  linear  program. 

5.  Max  Margin  AMNs 

The  transformation  described  in  the  previous  sec¬ 
tion  applies  to  any  situation  where  the  MAP  problem 
can  be  effectively  approximated  as  a  linear  program. 
In  particular,  the  LP  relaxation  of  Eq.  (1)  provides 
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us  with  precisely  the  necessary  building  block  to  pro¬ 
vide  an  effective  solution  for  the  QP  in  Eq.  (4)  for  the 
case  of  AMNs.  As  we  discussed,  the  MAP  problem  is 
precisely  the  max  subproblem  in  this  QP.  In  the  case 
of  AMNs,  this  max  subproblem  can  be  replaced  with 
the  relaxed  LP  of  Eq.  (1).  In  effect,  we  are  replacing 
the  exponential  constraint  set  —  one  which  includes 
a  constraint  for  every  discrete  y,  with  an  infinite  con¬ 
straint  set  —  one  which  includes  a  constraint  for  every 
continuous  vector  y  in 

y  =  {y :  Vi  >  0;  E  Vi  = Vii  <  vh  4  <  v)} 

k 

as  defined  in  Eq.  (1). 

Stating  the  AMN  restrictions  in  terms  of  the  pa¬ 
rameters  w,  we  require  that  wj’1  =  0,  Vk  ^  l  and 
w g,fe  -Xjj  >  0.  To  ensure  that  Wg,fc  -x.y  >  0,  we  simply 
assume  (without  loss  of  generality)  that  x.(J-  >  0,  and 
constrain  Wg,fe  >  0.  Incorporating  this  constraint,  we 
obtain  our  basic  AMN  QP: 

min  ^||w||2  +  C£  (8) 

s.t.  wXy  -  N  +  £  >  max  wXy  -  yn  ■  yn; 

y  ey 

we  >  0. 

We  can  now  transform  this  QP  as  specified  in 
Eq.  (7),  by  taking  the  dual  of  the  LP  used  to  represent 
the  interior  max.  Specifically,  maxye y>  wXy  —  yn  •  yn 
is  a  feasible  and  bounded  linear  program  in  y,  with  a 
dual  given  by: 

N 

min  ^2  Zi  (9) 

i=l 

s.t.  Zi-  E  zfj  >  w*  ■  Xi  -  Vi,  kj 
{io)Aoi)^E 

4  +  4  ^  we’fc  •  xb>  4>  4  >  °>  v(*i)  V  E,  k. 

In  the  dual,  we  have  a  variable  z,  for  each  normaliza¬ 
tion  constraint  in  Eq.  (1)  and  variables  4 ,  4  for  each 
of  the  inequality  constraints. 

Substituting  this  dual  into  Eq.  (8),  we  obtain: 

min  ^||w||2  +  (10) 

N 

s.t.  wXy  -  N  +  £  >  22  zi'i  we  >  0; 

i= 1 

Zi-  E  Zij^wn-*i-yi%  Vi,fc| 

(b')>(A)e-E 

4  +  4  ^  we’fc  •  xb>  4>  4  ^  °>  v(u)  e  E>  k- 

For  K  =  2,  the  LP  relaxation  is  exact,  so 
that  Eq.  (10)  learns  exact  max-margin  weights  for 


Markov  networks  of  arbitrary  topology.  For  K  >  2, 
the  linear  relaxation  leads  to  a  strengthening  of  the 
constraints  on  w  by  potentially  adding  constraints  cor¬ 
responding  to  fractional  assignments  y.  Thus,  the  op¬ 
timal  choice  w,  £  for  the  original  QP  may  no  longer  be 
feasible,  leading  to  a  different  choice  of  weights.  How¬ 
ever,  as  our  experiments  show,  these  weights  tend  to 
do  well  in  practice. 

The  dual  of  Eq.  (10)  provides  some  insight  into  the 
structure  of  the  problem: 

N  K 

max  ^^(1  -4)4  (11) 

i—  1  k— 1 

I<  N  2 

5>(c4-4) 

k= 1  i=  1 

2 

K 

-oE  4  +  E  xb(c4-4) 

fc=l  (ij)EE 

S.t.  Hi  >  0,  Vi,  ft;  Yl4  =  c,  Vi; 

k 

4  >0,  4  <4,  4  <4  V(ij)e£,fc; 

Ae  >  0. 

As  in  the  original  M3N  optimization,  the  dual  vari¬ 
ables  have  an  intuitive  probabilistic  interpretation.  In 
the  binary  case,  the  set  of  the  variables  4  4  cor“ 
responds  to  marginals  of  a  distribution  (normalized 
to  C)  over  the  possible  assignments  y.  (This  asser¬ 
tion  follows  from  taking  the  dual  of  the  original  ex¬ 
ponential  size  QP  in  Eq.  (3).)  Then  the  constraints 
that  4  <  4  and  4  <  4  can  be  explained  by 
the  fact  that  P{yi  =  ?/j  =  ft)  <  P{yi  =  k)  and 

P(yi  =  yj  =  k)  <  P(yj  =  k)  for  any  distribution 

P( y).  For  K  >  2,  the  set  of  the  variables  4 ,  4  maY 
not  correspond  to  a  valid  distribution. 

The  primal  and  dual  solution  are  related  by: 

N 

w*  =  5>(c4-4),  (12) 

Z=1 

4*  =  4+  E  xb(^4-4).  (is) 

{ij)EE 

One  important  consequence  of  these  relationships  is 
that  the  node  parameters  are  all  support  vector  ex¬ 
pansions.  Thus,  the  terms  in  the  constraints  of  the 
form  w„x  can  all  be  expanded  in  terms  of  dot  products 
xj xj\  the  objective  ( 1 1 w 1 1 2 )  can  be  expanded  similarly. 
Therefore,  we  can  use  kernels  K(x.i ,  Xj)  to  define  node 
parameters.  Unfortunately,  the  positivity  constraint 
on  the  edge  potentials,  and  the  resulting  Ag  dual  vari¬ 
able  in  the  expansion  of  the  edge  weight,  prevent  the 
edge  parameters  from  being  kernelized  in  a  similar  way. 
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Crude  Money-fx  Grain  Trade  Average 


(a)  (b) 

Figure  2.  (a)  Comparison  of  test  error  of  SVMs  and  AMNs  on  four  categories  of  Reuters  articles,  averaged  over  7-folds; 
(b)  Comparison  of  test  error  of  SVMs,  RMNs  and  AMNs  on  four  WebKB  sites. 


6.  Experimental  Results 

We  evaluated  our  approach  on  two  text  classifica¬ 
tion  domains,  of  very  different  structure. 

Reuters.  We  ran  our  method  on  the  ModApte  set 
of  the  Reuters-21578  corpus.  We  selected  four  cate¬ 
gories  containing  a  substantial  number  of  documents: 
crude,  grain,  trade,  and  money-fx.  We  eliminated  doc¬ 
uments  labeled  with  more  than  one  category,  and  rep¬ 
resented  each  document  as  a  bag  of  words.  The  re¬ 
sulting  dataset  contained  around  2200  news  articles, 
which  were  split  into  seven  folds  where  the  articles  in 
each  fold  occur  in  the  same  time  period.  The  reported 
results  were  obtained  using  seven-fold  cross-validation 
with  a  training  set  size  of  ~  200  documents  and  a  test 
set  size  of  ~  2000  documents. 

The  baseline  model  is  a  linear  kernel  SVM  using  a 
bag  of  words  as  features.  Since  we  train  and  test  on 
articles  in  different  time  periods,  there  is  an  inherent 
distribution  drift  between  our  training  and  test  sets, 
which  hurts  the  SVM’s  performance.  For  example, 
there  may  be  words  which,  in  the  test  set,  are  highly 
indicative  of  a  certain  label,  but  are  not  present  in  the 
training  set  at  all  since  they  were  very  specific  to  a 
particular  time  period  (see  (Taskar  et  al.,  2003b)). 

Our  AMN  model  uses  the  text  similarity  of  two  ar¬ 
ticles  as  an  indicator  of  how  likely  they  are  to  have  the 
same  label.  The  intuition  is  that  two  documents  that 
have  similar  text  are  likely  to  share  the  same  label  in 
any  time  period,  so  that  adding  associative  edges  be¬ 
tween  them  would  result  in  better  classification.  Such 
positive  correlations  are  exactly  what  AMNs  represent. 
In  our  model,  we  linked  each  document  to  its  two  clos¬ 
est  documents  as  measured  by  TF-IDF  weighted  cosine 
distance.  The  TF-IDF  score  of  a  term  was  computed 
as:  (1  +  log  tf)  log  where  tf  is  the  term  frequency,  N 

is  the  number  of  total  documents,  and  df  is  the  doc¬ 
ument  frequency.  The  node  features  were  simply  the 
words  in  the  article  corresponding  to  the  node.  Edge 
features  included  the  actual  TF-IDF  weighted  cosine 
distance,  as  well  as  the  bag  of  words  consisting  of  union 
of  the  words  in  the  linked  documents. 


We  trained  both  models  (SVM  and  AMN)  to  pre¬ 
dict  one  category  vs.  all  remaining  categories.  Fig.  2(a) 
shows  that  the  AMN  model  achieves  a  13.5%  average 
error  reduction  over  the  baseline  SVM,  with  improve¬ 
ment  in  every  category.  Applying  a  paired  t-test  com¬ 
paring  the  AMN  and  SVM  over  the  7  folds  in  each 
category,  crude,  trade,  grain,  money-fx,  we  obtained  p- 
values  of  0.004897,  0.017026,  0.012836,  0.000291  re¬ 
spectively.  These  results  indicate  that  the  positive  in¬ 
teractions  learned  by  the  AMN  allow  us  to  correct  for 
some  of  the  distribution  drift  between  the  training  and 
test  sets. 

Hypertext.  We  tested  AMNs  on  collective  hy¬ 
pertext  classification,  using  the  variant  of  the  We¬ 
bKB  dataset  (Craven  et  al.,  1998)  used  by  Taskar  et 
al.  (2002).  This  data  set  contains  web  pages  from 
four  different  Computer  Science  departments:  Cornell, 
Texas,  Washington,  and  Wisconsin.  Each  page  is  la¬ 
beled  as  one  of  course,  faculty,  student,  project,  other. 
Our  goal  in  this  task  is  to  exploit  the  additional  struc¬ 
tured  information  in  hypertext  using  AMNs. 

Our  flat  model  is  a  multiclass  linear-kernel  SVM 
predicting  categories  based  on  the  text  content  of  the 
webpage.  The  words  are  represented  as  a  bag  of  words. 
For  the  AMN  model,  we  used  the  fact  that  a  web¬ 
page’s  internal  structure  can  be  broken  up  into  dis¬ 
joint  sections.  For  example,  a  faculty  webpage  might 
have  one  section  that  discusses  research,  with  a  list 
of  links  to  relevant  research  projects,  another  section 
with  links  to  student  webpages,  etc.  Intuitively,  if  we 
have  links  to  two  pages  in  the  same  section,  they  are 
likely  have  the  same  topic.  As  AMNs  capture  pre¬ 
cisely  this  type  of  positive  correlation,  we  added  edges 
between  pages  that  appear  as  hyperlinks  in  the  same 
section  of  another  page.  The  node  features  for  the 
AMN  model  are  the  same  as  for  the  multiclass  SVM. 

In  performing  the  experiments  we  train  on  the 
pages  from  three  of  the  schools  in  the  dataset  and  test 
on  the  remaining  one.  The  results,  shown  in  Fig.  2(b), 
demonstrate  a  30%  relative  reduction  in  test  error 
as  a  result  of  modeling  the  positive  correlation  be- 
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tween  pages  in  the  AMN  model.  The  improvement 
is  present  when  testing  on  each  of  the  schools.  We 
also  trained  the  same  AMN  model  using  the  RMN  ap¬ 
proach  of  Taskar  et  al.  (2002).  In  this  approach,  the 
Markov  network  is  trained  to  maximize  the  conditional 
log- likelihood,  using  loopy  belief  propagation  (Yedidia 
et  al.,  2000)  for  computing  the  posterior  probabilities 
needed  for  optimization.  Due  to  the  high  connectiv¬ 
ity  in  the  network,  the  algorithm  is  not  exact,  and  not 
guaranteed  to  converge  to  the  true  values  for  the  poste¬ 
rior  distribution.  In  our  results,  RMNs  achieve  a  worse 
test  error  than  AMNs.  We  note  that  the  learned  AMN 
weights  never  produced  fractional  solutions  when  used 
for  inference,  which  suggests  that  the  optimization  suc¬ 
cessfully  avoided  problematic  parameterizations  of  the 
network,  even  in  the  case  of  the  non-optimal  multi¬ 
class  relaxation. 

7.  Conclusion 

In  this  paper,  we  provide  an  algorithm  for  max- 
margin  training  of  associative  Markov  networks,  a 
subclass  of  Markov  networks  that  allows  only  posi¬ 
tive  interactions  between  related  variables.  Our  ap¬ 
proach  relies  on  a  linear  programming  relaxation  of 
the  MAP  problem,  which  is  the  key  component  in  the 
quadratic  program  associated  with  the  max-margin 
formulation.  We  thus  provide  a  polynomial  time  algo¬ 
rithm  which  approximately  solves  the  maximum  mar¬ 
gin  estimation  problem  for  any  associative  Markov 
network.  Importantly,  our  method  is  guaranteed  to 
find  the  optimal  (margin-maximizing)  solution  for  all 
binary-valued  AMNs,  regardless  of  the  clique  size  or 
the  connectivity.  To  our  knowledge,  this  algorithm  is 
the  first  to  provide  an  effective  learning  procedure  for 
Markov  networks  of  such  general  structure. 

Our  results  in  the  binary  case  rely  on  the  fact  that 
the  LP  relaxation  of  the  MAP  problem  provides  exact 
solutions.  In  the  non-binary  case,  we  are  not  guar¬ 
anteed  exact  solutions,  but  we  can  prove  constant- 
factor  approximation  bounds  on  the  MAP  solution  re¬ 
turned  by  the  relaxed  LP.  It  would  be  interesting  to 
see  whether  these  bounds  provide  us  with  guarantees 
on  the  quality  (e.g.,  the  margin)  of  our  learned  model. 

The  class  of  associative  Markov  networks  appears 
to  cover  a  large  number  of  interesting  applications.  We 
have  explored  only  two  such  applications  in  our  exper¬ 
imental  results,  both  in  the  text  domain.  It  would  be 
very  interesting  to  consider  other  applications,  such 
as  image  segmentation,  extracting  protein  complexes 
from  protein-protein  interaction  data,  or  predicting 
links  in  relational  data. 

However,  despite  the  prevalence  of  fully  associa¬ 
tive  Markov  networks,  it  is  clear  that  many  applica¬ 
tions  call  for  repulsive  potentials.  For  example,  the 


best  classification  accuracy  on  the  WebKB  hypertext 
data  set  is  obtained  in  a  maximum  margin  frame¬ 
work  (Taskar  et  al.,  2003a),  when  we  allow  repulsive 
potentials  on  linked  webpages  (representing,  for  ex¬ 
ample,  that  students  tend  not  to  link  to  pages  of  stu¬ 
dents).  While  clearly  we  cannot  introduce  fully  gen¬ 
eral  potentials  into  AMNs  without  running  against  the 
NP-hardness  of  the  general  problem,  it  would  be  in¬ 
teresting  to  see  whether  we  can  extend  the  class  of 
networks  we  can  learn  effectively. 
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A.  Binary  AMNs 

Proof  (For  Theorem  3.1)  Consider  any  fractional,  fea¬ 
sible  y.  We  show  that  we  can  construct  a  new  feasible 
assignment  z  which  increases  the  objective  (or  leaves 
it  unchanged)  and  furthermore  has  fewer  fractional  en¬ 
tries. 

Since  Oj  >  0,  we  can  assume  that  yj  =  miniecyf; 
otherwise  we  could  increase  the  objective  by  increasing 
yj.  We  construct  an  assignment  z  from  y  by  leaving 
integral  values  unchanged  and  uniformly  shifting  frac¬ 
tional  values  by  A: 

4  =  Vi  -  A/(0  <  yj  <  1),  zj  =  yj  +  A/(0  <  yj  <  1), 
4  =  y\  -  A/(°  <  y\  <  !)>  zl  =  hi  +  A/(0  <  y2c  <  1), 

where  /(•)  is  an  indicator  function. 

Now  consider  Afe  =  min i:yk>0  yj.  Note  that  if  A  = 
A1  or  A  =  —  A2,  z  will  have  at  least  one  more  integral  zj 
than  y.  Thus  if  we  can  show  that  the  update  results  in 
a  feasible  and  better  scoring  assignment,  we  can  apply 
it  repeatedly  to  get  an  optimal  integer  solution.  To 
show  that  z  is  feasible,  we  need  zj  +  zj  =  1,  zj  >  0 
and  z j  =  minj6c  zj. 

First,  we  show  that  zj  +  zj  =  1. 

4  +  4  =  Vi  -  A/(0  <  yj  <  1)  +  y^  +  A/(0  <  yj  <  1) 
=  i it  +  yj  =  i- 

Above  we  used  the  fact  that  if  yj  is  fractional,  so  is 
yj,  since  yj  +  yj  =  1. 

To  show  that  zj  >  0,  we  prove  min,  zj  =  0. 


mm  Zi  =  mm 


yj  -  (  min  yj )/(0  <  yj  <  1) 
i:yfe>0 


=  mm  mm  y-  ,  mm 

V  i  i:y^>  0 


yj  ~  min  yj 
i:y^>  0 


=  o. 


Lastly,  we  show  =  min,jec  zj. 


tj  =  yj  -  XI (0  <  yj  <  1) 


=  (min  yi )  —  A/(0  <  min  y*  <  1)  =  min  z\\ 

iGc  zGc  iEc 


=  2/c  +  AAO  <  yj  <  i) 


=  (min  yj)  +  XI (0  <  min  yj  <  1)  =  min  zj. 

i£c  zEc  iEc 


We  have  established  that  the  new  z  are  feasible, 
and  it  remains  to  show  that  we  can  improve  the  objec¬ 
tive.  We  can  show  that  the  change  in  the  objective  is 
always  A D  for  some  constant  D  that  depends  only  on 
y  and  9.  This  implies  that  one  of  the  two  cases,  A  =  A1 
or  A  =  —A2,  will  necessarily  increase  the  objective  (or 


leave  it  unchanged).  The  change  in  the  objective  is: 


N 

E  E  dj(zj-yj)  +  j2  E  9c (zc-yj) 

i=l  k=l,2  cec  /c— 1,2 

"  N 

=  A  J2(Dj  -  Di)  +  E(Dc  -  Dj)  =  A D 

_i=l  c£C 

Dj  =  ejl{ 0  <  yj  <  1),  Dj  =  6jl{ 0  <  yj  <  1). 


Hence  the  new  assignment  z  is  feasible,  does  not 
decrease  the  objective  function,  and  has  strictly  fewer 
fractional  entries.  | 


B.  Multi-class  AMNs 

For  K  >  2,  we  use  the  randomized  rounding  pro¬ 
cedure  of  Kleinberg  and  Tardos  (1999)  to  produce  an 
integer  solution  for  the  linear  relaxation,  losing  at  most 
a  factor  of  in  =  maxcSc  |c|  in  the  objective  function. 
The  basic  idea  of  the  rounding  procedure  is  to  treat 
yj  as  probabilities  and  assign  labels  according  to  these 
probabilities  in  phases.  In  each  phase,  we  pick  a  label 
k,  uniformly  at  random,  and  a  threshold  a  £  [0, 1]  uni¬ 
formly  at  random.  For  each  node  i  which  has  not  yet 
been  assigned  a  label,  we  assign  the  label  k  if  yj  >  a. 
The  procedure  terminates  when  all  nodes  have  been 
assigned  a  label.  Our  analysis  closely  follows  that  of 
Tardos  (1999). 

Lemma  B.l  The  probability  that  a  node  i  is  assigned 
label  k  by  the  randomized  procedure  is  yj. 

Proof  The  probability  that  an  unassigned  node  is  as¬ 
signed  label  k  during  one  phase  is  j^yj,  which  is  pro¬ 
portional  to  yj.  By  symmetry,  the  probability  that  a 
node  is  assigned  label  k  over  all  phases  is  exactly  yj. 

I 

Lemma  B.2  The  probability  that  all  nodes  in  a  clique 
c  are  assigned  label  k  by  the  procedure  is  at  least  ^  yj. 

Proof  For  a  single  phase,  the  probability  that  all 
nodes  in  a  clique  c  are  assigned  label  k  if  none  of  the 
nodes  were  previously  assigned  is  jr  minj6c  yj  =  j^yj- 
The  probability  that  at  least  one  of  the  nodes  will  be 
assigned  label  k  in  a  phase  is  ^(maxl£c  yj).  The  prob¬ 
ability  that  none  of  the  nodes  in  the  clique  will  be  as¬ 
signed  any  label  in  one  phase  is  1  —  A  jTj'jj-i  m axiec  yj- 
Nodes  in  the  clique  c  will  be  assigned  label  k  by 
the  procedure  if  they  are  assigned  label  k  in  one  phase. 
(They  can  also  be  assigned  label  k  as  a  result  of  sev¬ 
eral  phases,  but  we  can  ignore  this  possibility  for  the 
purposes  of  the  lower  bound.)  The  probability  that  all 
the  nodes  in  c  will  be  assigned  label  k  by  the  procedure 
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in  a  single  phase  is: 
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Above,  we  first  used  the  fact  that  for  d  <  1, 
E,Eq  d%  =  and  then  upper-bounded  the  max  of 
the  set  of  positive  yf  ’ s  by  their  sum.  | 


Theorem  B.3  The  expected  cost  of  the  assignment 
found  by  the  randomized  procedure  given  a  solu¬ 
tion  y  to  the  linear  program  in  Eq.  (2)  is  at  least 

E£i  Ef=i  +  Ec6C  r  Ef=i  °kcvl 


Proof  This  is  immediate  from  the  previous  two  lem¬ 
mas. 

The  only  difference  between  the  expected  cost  of 
the  rounded  solution  and  the  (non-integer)  optimal  so¬ 
lution  is  the  jij-  factor  in  the  second  term.  By  picking 
to  =  maxcgc  |c|,  we  have  that  the  rounded  solution 
is  at  most  m  times  worse  than  the  optimal  solution 
produced  by  the  LP  of  Eq.  (2).  | 


We  can  also  derandomize  this  procedure  to  get  a 
deterministic  algorithm  with  the  same  guarantees,  us¬ 
ing  the  method  of  conditional  probabilities,  similar  in 
spirit  to  the  approach  of  Kleinberg  and  Tardos  (1999). 

Note  that  the  approximation  factor  of  m  applies, 
in  fact,  only  to  the  clique  potentials.  Thus,  if  we  com¬ 
pare  the  log-probability  of  the  optimal  MAP  solution 
and  the  log-probability  of  the  assignment  produced  by 
this  randomized  rounding  procedure,  the  terms  cor¬ 
responding  to  the  log-partition-function  and  the  node 
potentials  are  identical.  We  obtain  an  additive  error 
(in  log-probability  space)  only  for  the  clique  potentials. 
As  node  potentials  are  often  larger  in  magnitude  than 
clique  potentials,  the  fact  that  we  incur  no  loss  pro¬ 
portional  to  node  potentials  is  likely  to  lead  to  smaller 
errors  in  practice.  Along  similar  lines,  we  note  that  the 
constant  factor  approximation  is  smaller  for  smaller 
cliques;  again,  we  observe,  the  potentials  associated 
with  large  cliques  are  typically  smaller  in  magnitude, 
reducing  further  the  actual  error  in  practice. 
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Abstract 

In  many  supervised  learning  tasks,  the  entities  to  be 
labeled  are  related  to  each  other  in  complex  ways  and 
their  labels  are  not  independent.  For  example,  in  hy¬ 
pertext  classification,  the  labels  of  linked  pages  are 
highly  correlated.  A  standard  approach  is  to  clas¬ 
sify  each  entity  independently,  ignoring  the  correla¬ 
tions  between  them.  Recently,  Probabilistic  Relational 
Models,  a  relational  version  of  Bayesian  networks, 
were  used  to  define  a  joint  probabilistic  model  for  a 
collection  of  related  entities.  In  this  paper,  we  present 
an  alternative  framework  that  builds  on  (conditional) 
Markov  networks  and  addresses  two  limitations  of  the 
previous  approach.  First,  undirected  models  do  not  im¬ 
pose  the  acyclicity  constraint  that  hinders  representa¬ 
tion  of  many  important  relational  dependencies  in  di¬ 
rected  models.  Second,  undirected  models  are  well 
suited  for  discriminative  training,  where  we  optimize 
the  conditional  likelihood  of  the  labels  given  the  fea¬ 
tures,  which  generally  improves  classification  accu¬ 
racy.  We  show  how  to  train  these  models  effectively, 
and  how  to  use  approximate  probabilistic  inference 
over  the  learned  model  for  collective  classification  of 
multiple  related  entities.  We  provide  experimental  re¬ 
sults  on  a  webpage  classification  task,  showing  that 
accuracy  can  be  significantly  improved  by  modeling 
relational  dependencies. 

1  Introduction 

The  vast  majority  of  work  in  statistical  classification 
methods  has  focused  on  “flat”  data  -  data  consisting 
of  identically-structured  entities,  typically  assumed  to  be 
independent  and  identically  distributed  (IID).  However, 
many  real-world  data  sets  are  innately  relational:  hyper- 
linked  webpages,  cross-citations  in  patents  and  scientific 
papers,  social  networks,  medical  records,  and  more.  Such 
data  consist  of  entities  of  different  types,  where  each  entity 
type  is  characterized  by  a  different  set  of  attributes.  Entities 
are  related  to  each  other  via  different  types  of  links,  and  the 
link  structure  is  an  important  source  of  information. 


Consider  a  collection  of  hypertext  documents  that  we 
want  to  classify  using  some  set  of  labels.  Most  naively,  we 
can  use  a  bag  of  words  model,  classifying  each  webpage 
solely  using  the  words  that  appear  on  the  page.  However, 
hypertext  has  a  very  rich  structure  that  this  approach  loses 
entirely.  One  document  has  hyperlinks  to  others,  typically 
indicating  that  their  topics  are  related.  Each  document  also 
has  internal  structure,  such  as  a  partition  into  sections;  hy¬ 
perlinks  that  emanate  from  the  same  section  of  the  docu¬ 
ment  are  even  more  likely  to  point  to  similar  documents. 
When  classifying  a  collection  of  documents,  these  are  im¬ 
portant  cues,  that  can  potentially  help  us  achieve  better 
classification  accuracy.  Therefore,  rather  than  classifying 
each  document  separately,  we  want  to  provide  a  form  of 
collective  classification,  where  we  simultaneously  decide 
on  the  class  labels  of  all  of  the  entities  together,  and  thereby 
can  explicitly  take  advantage  of  the  correlations  between 
the  labels  of  related  entities. 

We  propose  the  use  of  a  joint  probabilistic  model  for  an 
entire  collection  of  related  entities.  Following  the  approach 
of  Lafferty  (2001),  we  base  our  approach  on  discrimina- 
tively  trained  undirected  graphical  models,  or  Markov  net¬ 
works  (Pearl  1988).  We  introduce  the  framework  of  rela¬ 
tional  Markov  network  (RMNs),  which  compactly  defines 
a  Markov  network  over  a  relational  data  set.  The  graphi¬ 
cal  structure  of  an  RMN  is  based  on  the  relational  structure 
of  the  domain,  and  can  easily  model  complex  patterns  over 
related  entities.  For  example,  we  can  represent  a  pattern 
where  two  linked  documents  are  likely  to  have  the  same 
topic.  We  can  also  capture  patterns  that  involve  groups  of 
links:  for  example,  consecutive  links  in  a  document  tend  to 
refer  to  documents  with  the  same  label.  As  we  show,  the 
use  of  an  undirected  graphical  model  avoids  the  difficulties 
of  defining  a  coherent  generative  model  for  graph  struc¬ 
tures  in  directed  models.  It  thereby  allows  us  tremendous 
flexibility  in  representing  complex  patterns. 

Undirected  models  lend  themselves  well  to  discrimi¬ 
native  training,  where  we  optimize  the  conditional  likeli¬ 
hood  of  the  labels  given  the  features.  Discriminative  train¬ 
ing,  given  sufficient  data,  generally  provides  significant  im¬ 
provements  in  classification  accuracy  over  generative  train¬ 
ing  (Vapnik  1995).  We  provide  an  effective  parameter  esti- 
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mation  algorithm  for  RMNs  which  uses  conjugate  gradient 
combined  with  approximate  probabilistic  inference  (belief 
propagation  (Pearl  1988))  for  estimating  the  gradient.  We 
also  show  how  to  use  approximate  probabilistic  inference 
over  the  learned  model  for  collective  classification  of  mul¬ 
tiple  related  entities.  We  provide  experimental  results  on 
a  webpage  classification  task,  showing  significant  gains  in 
accuracy  arising  both  from  the  modeling  of  relational  de¬ 
pendencies  and  the  use  of  discriminative  training. 

2  Relational  Classification 

Consider  hypertext  as  a  simple  example  of  a  relational  do¬ 
main.  A  relational  domain  is  defined  by  a  schema,  which 
describes  entities,  their  attributes  and  relations  between 
them.  In  our  domain,  there  are  two  entity  types:  Doc  and 
Link.  If  a  webpage  is  represented  as  a  bag  of  words.  Doc 
would  have  a  set  of  boolean  attributes  Doc .HasWordk  in¬ 
dicating  whether  the  word  k  occurs  on  the  page.  It  would 
also  have  the  label  attribute  Doc  .Label,  indicating  the  topic 
of  the  page,  which  takes  on  a  set  of  categorical  values.  The 
Link  entity  type  has  two  attributes:  Link  .From  and  Link.7b, 
both  of  which  refer  to  Doc  entities. 

In  general,  a  schema  specifies  of  a  set  of  entity  types 
8  =  {Ei,...,En}.  Each  type  E  is  associated  with 
three  sets  of  attributes:  content  attributes  E.X.  (e.g. 
Doc.HasWordk),  label  attributes  E. Y  (e.g.  Doc.Label), 
and  reference  attributes  E.Il  (e.g.  Link. To).  For  sim¬ 
plicity,  we  restrict  label  and  content  attributes  to  take  on 
categorical  values.  Reference  attributes  include  a  special 
unique  key  attribute  E.K  that  identifies  each  entity.  Other 
reference  attributes  E.R  refer  to  entities  of  a  single  type 
E'  =  Range(E.R)  and  take  values  in  Domain(E' .K). 

An  instantiation  I  of  a  schema  8  specifies  the  set  of  en¬ 
tities  1(E)  of  each  entity  type  E  €  8  and  the  values  of  all 
attributes  for  all  of  the  entities.  For  example,  an  instanti¬ 
ation  of  the  hypertext  schema  is  a  collection  of  webpages, 
specifying  their  labels,  words  they  contain  and  links  be¬ 
tween  them.  We  will  use  X.X,  X.Y  and  X.R  to  denote  the 
content,  label  and  reference  attributes  in  the  instantiation 
X;  X.x,  X.y  and  X.r  to  denote  the  values  of  those  attributes. 
The  componentX.r,  which  we  call  an  instantiation  skeleton 
or  instantiation  graph ,  specifies  the  set  of  entities  (nodes) 
and  their  reference  attributes  (edges).  A  hypertext  instanti¬ 
ation  graph  specifies  a  set  of  webpages  and  links  between 
them,  but  not  their  words  or  labels. 

The  structure  of  the  instantiation  graph  has  been  used 
extensively  to  infer  their  importance  in  scientific  publica¬ 
tions  (Egghe  and  Rousseau  1990)  and  hypertext  (Kleinberg 
1999).  Several  recent  papers  have  proposed  algorithms 
that  use  the  link  graph  to  aid  classification.  Chakrabarti  et 
al.  (1998)  use  system-predicted  labels  of  linked  documents 
to  iteratively  re-label  each  document  in  the  test  set,  achiev¬ 
ing  a  significant  improvement  compared  to  a  baseline  of 
using  the  text  in  each  document  alone.  A  similar  approach 
was  used  by  Neville  and  lensen  (2000)  in  a  different  do¬ 
main.  Slattery  and  Mitchell  (2000)  tried  to  identify  direc¬ 


tory  (or  hub)  pages  that  commonly  list  pages  of  the  same 
topic,  and  used  these  pages  to  improve  classification  of  uni¬ 
versity  webpages.  However,  none  of  these  approaches  pro¬ 
vide  a  coherent  model  for  the  correlations  between  linked 
webpages.  Thus,  they  apply  combinations  of  classifiers  in 
a  procedural  way,  with  no  formal  justification. 

Taskar  et  al.  (2001)  suggest  the  use  of  probabilistic  rela¬ 
tional  models  (PRMs)  for  the  collective  classification  task. 
PRMs  (Roller  and  Pfeffer  1998;  Friedman  et  al.  1999)  are 
a  relational  extension  to  Bayesian  networks  (Pearl  1988). 
A  PRM  specifies  a  probability  distribution  over  instantia¬ 
tions  consistent  with  a  given  instantiation  graph  by  speci¬ 
fying  a  Bayesian-network-like  template-level  probabilistic 
model  for  each  entity  type.  Given  a  particular  instantia¬ 
tion  graph,  the  PRM  induces  a  large  Bayesian  network  over 
that  instantiation  that  specifies  a  joint  probability  distribu¬ 
tion  over  all  attributes  of  all  of  the  entities.  This  network 
reflects  the  interactions  between  related  instances  by  allow¬ 
ing  us  to  represent  correlations  between  their  attributes. 

In  our  hypertext  example,  a  PRM  might  use  a  naive 
Bayes  model  for  words,  with  a  directed  edge  between 
Doc  .Label  and  each  attribute  Doc  .HadWordk',  each  of  these 
attributes  would  have  a  conditional  probability  distribu¬ 
tion  P (Doc. HasWordk  \  Doc.Label)  associated  with  it, 
indicating  the  probability  that  word  k  appears  in  the  doc¬ 
ument  given  each  of  the  possible  topic  labels.  More  im¬ 
portantly,  a  PRM  can  represent  the  inter-dependencies  be¬ 
tween  topics  of  linked  documents  by  introducing  an  edge 
from  Doc  .Label  to  Doc.Label  of  two  documents  if  there  is 
a  link  between  them.  Given  a  particular  instantiation  graph 
containing  some  set  of  documents  and  links,  the  PRM  spec¬ 
ifies  a  Bayesian  network  over  all  of  the  documents  in  the 
collection.  We  would  have  a  probabilistic  dependency  from 
each  document’s  label  to  the  words  on  the  document,  and 
a  dependency  from  each  document’s  label  to  the  labels  of 
all  of  the  documents  to  which  it  points.  Taskar  et  al.  show 
that  this  approach  works  well  for  classifying  scientific  doc¬ 
uments,  using  both  the  words  in  the  title  and  abstract  and 
the  citation-link  structure. 

However  the  application  of  this  idea  to  other  domains, 
such  as  webpages,  is  problematic  since  there  are  many  cy¬ 
cles  in  the  link  graph,  leading  to  cycles  in  the  induced 
“Bayesian  network’’,  which  is  therefore  not  a  coherent 
probabilistic  model.  Getoor  et  al.  (2001)  suggest  an  ap¬ 
proach  where  we  do  not  include  direct  dependencies  be¬ 
tween  the  labels  of  linked  webpages,  but  rather  treat  links 
themselves  as  random  variables.  Each  two  pages  have  a 
“potential  link”,  which  may  or  may  not  exist  in  the  data. 
The  model  defines  the  probability  of  the  link  existence  as 
a  function  of  the  labels  of  the  two  endpoints.  In  this  link 
existence  model,  labels  have  no  incoming  edges  from  other 
labels,  and  the  cyclicity  problem  disappears.  This  model, 
however,  has  other  fundamental  limitations.  In  particular, 
the  resulting  Bayesian  network  has  a  random  variable  for 
each  potential  link  —  N 2  variables  for  collections  contain¬ 
ing  N  pages.  This  quadratic  blowup  occurs  even  when  the 
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actual  link  graph  is  very  sparse.  When  N  is  large  (e.g., 
the  set  of  all  webpages),  a  quadratic  growth  is  intractable. 
Even  more  problematic  are  the  inherent  limitations  on  the 
expressive  power  imposed  by  the  constraint  that  the  di¬ 
rected  graph  must  represent  a  coherent  generative  model 
over  graph  structures.  The  link  existence  model  assumes 
that  the  presence  of  different  edges  is  a  conditionally  in¬ 
dependent  event.  Representing  more  complex  patterns  in¬ 
volving  correlations  between  multiple  edges  is  very  diffi¬ 
cult.  For  example,  if  two  pages  point  to  the  same  page,  it 
is  more  likely  that  they  point  to  each  other  as  well.  Such 
interactions  between  many  overlapping  triples  of  links  do 
not  fit  well  into  the  generative  framework. 

Furthermore,  directed  models  such  as  Bayesian  net¬ 
works  and  PRMs  are  usually  trained  to  optimize  the  joint 
probability  of  the  labels  and  other  attributes,  while  the  goal 
of  classification  is  a  discriminative  model  of  labels  given 
the  other  attributes.  The  advantage  of  training  a  model  only 
to  discriminate  between  labels  is  that  it  does  not  have  to 
trade  off  between  classification  accuracy  and  modeling  the 
joint  distribution  over  non-label  attributes.  In  many  cases, 
discriminatively  trained  models  are  more  robust  to  viola¬ 
tions  of  independence  assumptions  and  achieve  higher  clas¬ 
sification  accuracy  than  their  generative  counterparts. 

3  Undirected  Models  for  Classification 

As  discussed,  our  approach  to  the  collective  classification 
task  is  based  on  the  use  of  undirected  graphical  models.  We 
begin  by  reviewing  Markov  networks ,  a  “flat”  undirected 
model.  We  then  discuss  how  Markov  networks  can  be  ex¬ 
tended  to  the  relational  setting. 

Markov  networks.  We  use  V  to  denote  a  set  of  discrete 
random  variables  and  v  an  assignment  of  values  to  V.  A 
Markov  network  for  V  defines  a  joint  distribution  over  V. 
It  consists  of  a  qualitative  component,  an  undirected  depen¬ 
dency  graph,  and  a  quantitative  component,  a  set  of  param¬ 
eters  associated  with  the  graph.  For  a  graph  G,  a  clique  is 
a  set  of  nodes  Vc  in  G ,  not  necessarily  maximal,  such  that 
each  Vi,Vj  €  Vc  are  connected  by  an  edge  in  G.  Note  that 
a  single  node  is  also  considered  a  clique. 

Definition  1:  Fet  G  =  (V,  E)  be  an  undirected  graph  with 
a  set  of  cliques  C(G).  Each  c  £  C(G)  is  associated  with 
a  set  of  nodes  Vc  and  a  clique  potential  <f>c(Vc),  which  is 
a  non-negative  function  defined  on  the  joint  domain  of  Vc. 
Fet  $  =  {^c(Vc)}cgc(G).  The  Markov  net  ( G ,  $)  defines 
the  distribution  P(v)  =  P  riceC(G)  ^c(vc)>  where  Z  is 
the  partition  function  —  a  normalization  constant  given  by 

z  =  n^(v').i 

Each  potential  (j>c  is  simply  a  table  of  values  for  each  as¬ 
signment  vc  that  defines  a  “compatibility”  between  values 
of  variables  in  the  clique.  The  potential  is  often  represented 
by  a  log-linear  combination  of  a  small  set  of  indicator  func¬ 
tions,  or  features,  of  the  form  /(V0)  =  <5(VC  =  vc). 
In  this  case,  the  potential  can  be  more  conveniently  rep- 


Figure  1:  An  unrolled  Markov  net  over  linked  documents. 
The  links  follow  a  common  pattern:  documents  with  the 
same  label  tend  to  link  to  each  other  more  often. 

resented  in  log-linear  form: 

^c(vc)  =  exp (52wifi(vc)}  =  exp{wc  -  fc(vc)}  . 
i 

Hence  we  can  write: 

log  P(v)  =  ^2  wc  •  fc(vc)  -  log  Z  =  w  •  f(v)  -  log  Z 

c 

where  w  and  f  are  the  vectors  of  all  weights  and  features. 

For  classification,  we  are  interested  in  constructing  dis¬ 
criminative  models  using  conditional  Markov  nets  which 
are  simply  Markov  networks  renormalized  to  model  a  con¬ 
ditional  distribution. 

Definition  2 :  Fet  X  be  a  set  of  random  variables  on 
which  we  condition  and  Y  be  a  set  of  target  (or  la¬ 
bel)  random  variables.  A  conditional  Markov  network 
is  a  Markov  network  (G,  <I>)  which  defines  the  distribu¬ 
tion  P(y  I  x)  =  ncGc(G)  <MXC,  yc),  where  Z(x) 
is  the  partition  function,  now  dependent  on  x:  Z(x)  = 
Ey’  n^c(xc,yc)- 1 

Fogistic  regression,  a  well-studied  statistical  model  for 
classification,  can  be  viewed  as  the  simplest  example  of 
a  conditional  Markov  network.  In  standard  form,  for 
Y  =  ±1  and  X  e  {0, 1}"  (or  X  e  5J?"),  P(y  |  x)  = 
2^yexp{yw  •  x}.  Viewing  the  model  as  a  Markov  net¬ 
work,  the  cliques  are  simply  the  edges  cu  =  {Xk ,  Y }  with 
potentials  <j>k\xk,y)  =  exp {ywkxk}. 

Relational  Markov  Networks.  We  now  extend  the  frame¬ 
work  of  Markov  networks  to  the  relational  setting.  A  rela¬ 
tional  Markov  network  (RMN)  specifies  a  conditional  dis¬ 
tribution  over  all  of  the  labels  of  all  of  the  entities  in  an 
instantiation  given  the  relational  structure  and  the  content 
attributes.  (We  provide  the  definitions  directly  for  the  con¬ 
ditional  case,  as  the  unconditional  case  is  a  special  case 
where  the  set  of  content  attributes  is  empty.)  Roughly 
speaking,  it  specifies  the  cliques  and  potentials  between  at¬ 
tributes  of  related  entities  at  a  template  level,  so  a  single 
model  provides  a  coherent  distribution  for  any  collection 
of  instances  from  the  schema. 

For  example,  suppose  that  pages  with  the  same  label 
tend  to  link  to  each  other,  as  in  Fig.  1.  We  can  capture  this 
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correlation  between  labels  by  introducing,  for  each  link,  a 
clique  between  the  labels  of  the  source  and  the  target  page. 
The  potential  on  the  clique  will  have  higher  values  for  as¬ 
signments  that  give  a  common  label  to  the  linked  pages. 

To  specify  what  cliques  should  be  constructed  in  an  in¬ 
stantiation,  we  will  define  a  notion  of  a  relational  clique 
template.  A  relational  clique  template  specifies  tuples  of 
variables  in  the  instantiation  by  using  a  relational  query  lan¬ 
guage.  For  our  link  example,  we  can  write  the  template  as 
a  kind  of  SQL  query: 

SELECT  doc  1. Category,  doc2.Category 

FROM  Doc  docl.  Doc  doc2.  Link  link 

WHERE  link.From  =  docl. Key  and  link.To  =  doc2.Key 

Note  the  three  clauses  that  define  a  query:  the  FROM 
clause  specifies  the  cross  product  of  entities  to  be  filtered 
by  the  WHERE  clause  and  the  SELECT  clause  picks  out 
the  attributes  of  interest.  Our  definition  of  clique  templates 
contains  the  corresponding  three  parts. 

Definition  3:  A  relational  clique  template  C  =  (F,  W,  S) 
consists  of  three  components: 

•  F  =  {Fi}  —  a  set  of  entity  variables,  where  an  entity 
variable  Fi  is  of  type  E(Fi). 

•  W(F.R)  —  a  boolean  formula  using  conditions  of 
the  form  Fi.Rj  =  Fk-Ri- 

•  F.S  C  F.XUF.Y  —  a  selected  subset  of  content  and 
label  attributes  in  F.  | 

For  the  clique  template  corresponding  to  the  SQL 
query  above,  F  consists  of  docl,  doc2  and  link  of 
types  Doc,  Doc  and  Link,  respectively.  W(F.R)  is 
link.From  =  docl. Key  A  link.To  =  doc2.Key  and  F.S 
is  docl. Category  and  doc2.  Cate  gory. 

A  clique  template  specifies  a  set  of  cliques  in  an  instan¬ 
tiation  X: 

<7(1)  =  {c  =  f.S  :  f  e  1(F)  A  W(f.r)}, 

where  f  is  a  tuple  of  entities  {_/)}  in  which  each  fi  is  of 
type E{Fi)\ X(F)  =  I(E(F{))  x  ...  xl(E(Fn))  denotes 
the  cross-product  of  entities  in  the  instantiation;  the  clause 
W(f.r)  ensures  that  the  entities  are  related  to  each  other 
in  specified  ways;  and  finally,  f.S  selects  the  appropriate 
attributes  of  the  entities.  Note  that  the  clique  template  does 
not  specify  the  nature  of  the  interaction  between  the  at¬ 
tributes;  that  is  determined  by  the  clique  potentials,  which 
will  be  associated  with  the  template. 

This  definition  of  a  clique  template  is  very  flexible,  as 
the  WHERE  clause  of  a  template  can  be  an  arbitrary  predi¬ 
cate.  It  allows  modeling  complex  relational  patterns  on  the 
instantiation  graphs.  To  continue  our  webpage  example, 
consider  another  common  pattern  in  hypertext:  links  in  a 
webpage  tend  to  point  to  pages  of  the  same  category.  This 
pattern  can  be  expressed  by  the  following  template: 

SELECT  docl. Category,  doc2.Category 

FROM  Doc  docl.  Doc  doc2.  Link  linkl.  Link  link2 

WHERE  linkl.From  =  link2.From  and  linkl.To  =  docl. Key 

and  link2.To  =  doc2.Key  and  not  docl. Key  =  doc2.Key 


Depending  on  the  expressive  power  of  our  template  def¬ 
inition  language,  we  may  be  able  to  construct  very  complex 
templates  that  select  entire  subgraph  structures  of  an  instan¬ 
tiation.  We  can  easily  represent  patterns  involving  three  (or 
more)  interconnected  documents  without  worrying  about 
the  acyclicity  constraint  imposed  by  directed  models.  Since 
the  clique  templates  do  not  explicitly  depend  on  the  iden¬ 
tities  of  entities,  the  same  template  can  select  subgraphs 
whose  structure  is  fairly  different.  The  RMN  allows  us 
to  associate  the  same  clique  potential  parameters  with  all 
of  the  subgraphs  satisfying  the  template,  thereby  allowing 
generalization  over  a  wide  range  of  different  structures. 

Definition  4:  A  Relational  Markov  network  (RMN)  M.  — 
(C,  $)  specifies  a  set  of  clique  templates  C  and  corre¬ 
sponding  potentials  $  =  {(f>c}ce C  to  define  a  conditional 
distribution: 

P(X.y  |  X.x,X.r) 

=  wkiT)  n  n  yt) 

v  ’  '  ceCceC(x) 

where  Z(X.x,X.r)  is  the  normalizing  partition  function: 
Z(I.x. X.r)  =  Y2x.y'  ricec  n ceC(x)  ^c(7(.xc,X.y(,)  | 
Using  the  log-linear  representation  of  potentials, 
4>c(y c)  —  exp{wc  fc(Vc)}-  we  can  write 

logP(I.y  |  I.x,X.r) 

=  Y  H  wc-fc(X.xc,X.yc)  —  log  Z(X.x,  X.r) 
ceCceC(x) 

=  Y  wC'fc(I-x,X.y,I.r)  -  log  Z(l.x,  X.r) 

GeC 

=  wf(I.x,I.y,I.r)  -logZ(X.x,X.r) 
where 

fc(X.x,X.y,I.r)  =  Y  fc(Z.xc,X.yc) 

c£C(X) 

is  the  sum  over  all  appearances  of  the  template  C(X)  in  the 
instantiation,  and  f  is  the  vector  of  all  fc- 

Given  a  particular  instantiation  X  of  the  schema,  the 
RMN  M  produces  an  unrolled  Markov  network  over  the 
attributes  of  entities  in  X.  The  cliques  in  the  unrolled  net¬ 
work  are  determined  by  the  clique  templates  C.  We  have 
one  clique  for  each  c  €  C(X),  and  all  of  these  cliques 
are  associated  with  the  same  clique  potential  (pc .  In  our 
webpage  example,  an  RMN  with  the  link  feature  described 
above  would  define  a  Markov  net  in  which,  for  every  link 
between  two  pages,  there  is  an  edge  between  the  labels  of 
these  pages.  Fig.  1  illustrates  a  simple  instance  of  this  un¬ 
rolled  Markov  network. 

4  Learning  the  Models 

In  this  paper,  we  focus  on  the  case  where  the  clique  tem¬ 
plates  are  given;  our  task  is  to  estimate  the  clique  poten¬ 
tials,  or  feature  weights.  Thus,  assume  that  we  are  given  a 
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set  of  clique  templates  C  which  partially  specify  our  (re¬ 
lational)  Markov  network,  and  our  task  is  to  compute  the 
weights  w  for  the  potentials  <I>.  In  the  learning  task,  we 
are  given  some  training  set  D  where  both  the  content  at¬ 
tributes  and  the  labels  are  observed.  Any  particular  setting 
for  w  fully  specifies  a  probability  distribution  Pw  over  D . 
so  we  can  use  the  likelihood  as  our  objective  function,  and 
attempt  to  find  the  weight  setting  that  maximizes  the  likeli¬ 
hood  (ML)  of  the  labels  given  other  attributes.  However,  to 
help  avoid  overfitting,  we  assume  a  “shrinkage’'  prior  over 
the  weights  (a  zero-mean  Gaussian),  and  use  maximum  a 
posteriori  (MAP)  estimation.  More  precisely,  we  assume 
that  different  parameters  are  a  priori  independent  and  de- 
fine  p(wi)  =  ==exp{-wf/2o 2}. 

Both  the  ML  and  MAP  objective  functions  are  con¬ 
cave  and  there  are  many  methods  available  for  maximiz¬ 
ing  them.  Our  experience  is  that  conjugate  gradient  and 
even  simple  gradient  perform  very  well  for  logistic  regres¬ 
sion  (Minka  2000)  and  relational  Markov  nets. 

Learning  Markov  Networks.  We  first  consider  discrim¬ 
inative  MAP  training  in  the  flat  setting.  In  this  case  D  is 
simply  a  set  of  IID  instances;  let  d  index  over  all  labeled 
training  data  D.  The  discriminative  likelihood  of  the  data 
is  \\d  (yd  |  Xd) .  We  introduce  the  parameter  prior,  and 
maximize  the  log  of  the  resulting  MAP  objective  function: 

L(w,D)  =  ^  (w‘  f(xd,  yd)  ~  log  Z(xd))-^^-  +  C  . 

dED 

The  gradient  of  the  objective  function  is  computed  as: 

VL(w,£>)  =  (ffo.ifo)  -  EPJt(xd,Yd)])  -  . 

d€D 

The  last  term  is  the  shrinking  effect  of  the  prior  and  the 
other  two  terms  are  the  difference  between  the  expected 
feature  counts  and  the  empirical  feature  counts,  where  the 
expectation  is  taken  relative  to  Pw : 

EpJf(xd,Yd)]  =  (xd,y'd)Pw(y'd  \  xd)  . 

v' 

Thus,  ignoring  the  effect  of  the  prior,  the  gradient  is  zero 
when  empirical  and  expected  feature  counts  are  equal.1 
The  prior  term  gives  the  smoothing  we  expect  from  the 
prior:  small  weights  are  preferred  in  order  to  reduce  over¬ 
fitting.  Note  that  the  sum  over  y'  is  just  over  the  possible 
categorizations  for  one  data  sample  every  time. 

Learning  RMNs.  The  analysis  for  the  relational  setting  is 
very  similar.  Now,  our  data  set  D  is  actually  a  single  in¬ 
stantiation  X,  where  the  same  parameters  are  used  multiple 
times  —  once  for  each  different  entity  that  uses  a  feature. 
A  particular  choice  of  parameters  w  specifies  a  particular 

1The  solution  of  maximum  likelihood  estimation  with  log- 
linear  models  is  actually  also  the  solution  to  the  dual  problem  of 
maximum  entropy  estimation  with  constraints  that  empirical  and 
expected  feature  counts  must  be  equal  (Della  Pietra  et  al.  1997). 


RMN,  which  induces  a  probability  distribution  Pw  over  the 
unrolled  Markov  network.  The  product  of  the  likelihood 
of  X  and  the  parameter  prior  define  our  objective  function, 
whose  gradient  VX(w,  X)  again  consists  of  the  empirical 
feature  counts  minus  the  expected  features  counts  and  a 
smoothing  term  due  to  the  prior: 

f(X.y,X.x,X.r)  -  E„[f(X.Y,X.x,X.r)]  -  ^ 
where  the  expectation  Xpw  [f  (X.Y,  X.x,  X.r)]  is 

^f(I.y'!I.x,I.r)Pw(I.y'  |  X.x.X.r)  . 

I.y' 

This  last  formula  reveals  a  key  difference  between  the 
relational  and  the  flat  case:  the  sum  over  X.y'  involves 
the  exponential  number  of  assignments  to  all  the  label  at¬ 
tributes  in  the  instantiation.  In  the  flat  case,  the  probabil¬ 
ity  decomposes  as  a  product  of  probabilities  for  individ¬ 
ual  data  instances,  so  we  can  compute  the  expected  feature 
count  for  each  instance  separately.  In  the  relational  case, 
these  labels  are  correlated  —  indeed,  this  correlation  was 
our  main  goal  in  defining  this  model.  Hence,  we  need  to 
compute  the  expectation  over  the  joint  assignments  to  all 
the  entities  together.  Computing  these  expectations  over  an 
exponentially  large  set  is  the  expensive  step  in  calculating 
the  gradient.  It  requires  that  we  run  inference  on  the  un¬ 
rolled  Markov  network. 

Inference  in  Markov  Networks.  The  inference  task  in 
our  conditional  Markov  networks  is  to  compute  the  poste¬ 
rior  distribution  over  the  label  variables  in  the  instantiation 
given  the  content  variables.  Exact  algorithms  for  inference 
in  graphical  models  can  execute  this  process  efficiently  for 
specific  graph  topologies  such  as  sequences,  trees  and  other 
low  treewidth  graphs.  However,  the  networks  resulting 
from  domains  such  as  our  hypertext  classification  task  are 
very  large  (in  our  experiments,  they  contain  tens  of  thou¬ 
sands  of  nodes)  and  densely  connected.  Exact  inference  is 
completely  intractable  in  these  cases. 

We  therefore  resort  to  approximate  inference.  There  is 
a  wide  variety  of  approximation  schemes  for  Markov  net¬ 
works.  We  chose  to  use  belief  propagation  for  its  sim¬ 
plicity  and  relative  efficiency  and  accuracy.  Belief  Prop¬ 
agation  (BP)  is  a  local  message  passing  algorithm  intro¬ 
duced  by  Pearl  (1988).  It  is  guaranteed  to  converge  to 
the  correct  marginal  probabilities  for  each  node  only  for 
singly  connected  Markov  networks.  However,  recent  anal¬ 
ysis  (Yedidia  et  al.  2000)  provides  some  theoretical  justifi¬ 
cation.  Empirical  results  (Murphy  et  al.  1999)  show  that  it 
often  converges  in  general  networks,  and  when  it  does,  the 
marginals  are  a  good  approximation  to  the  correct  posteri¬ 
ors.  As  our  results  in  Section  5  show,  this  approach  works 
well  in  our  domain.  We  refer  the  reader  to  Yedidia  et  al. 
for  a  detailed  description  of  the  BP  algorithm. 

5  Experiments 

We  tried  out  our  framework  on  the  WebKB  dataset  (Craven 
et  al.  1998),  which  is  an  instance  of  our  hypertext  exam- 
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pie.  The  data  set  contains  webpages  from  four  different 
Computer  Science  departments:  Cornell,  Texas,  Washing¬ 
ton  and  Wisconsin.  Each  page  has  a  label  attribute,  repre¬ 
senting  the  type  of  webpage  which  is  one  of  course,  fac¬ 
ulty,  student,  project  or  other.  The  data  set  is  problematic 
in  that  the  category  other  is  a  grab-bag  of  pages  of  many 
different  types.  The  number  of  pages  classified  as  other 
is  quite  large,  so  that  a  baseline  algorithm  that  simply  al¬ 
ways  selected  other  as  the  label  would  get  an  average  ac¬ 
curacy  of  75%.  We  could  restrict  attention  to  just  the  pages 
with  the  four  other  labels,  but  in  a  relational  classification 
setting,  the  deleted  webpages  might  be  useful  in  terms  of 
their  interactions  with  other  webpages.  Hence,  we  compro¬ 
mised  by  eliminating  all  other  pages  with  fewer  than  three 
outlinks,  making  the  number  of  other  pages  commensurate 
with  the  other  categories.2 3  For  each  page,  we  have  access 
to  the  entire  html  of  the  page  and  the  links  to  other  pages. 
Our  goal  is  to  collectively  classify  webpages  into  one  of 
these  five  categories.  In  all  of  our  experiments,  we  learn  a 
model  from  three  schools  and  test  the  performance  of  the 
learned  model  on  the  remaining  school,  thus  evaluating  the 
generalization  performance  of  the  different  models. 

Unfortunately,  we  cannot  directly  compare  our  accuracy 
results  with  previous  work  because  different  papers  use  dif¬ 
ferent  subsets  of  the  data  and  different  training/test  splits. 
However,  we  compare  to  standard  text  classifiers  such  as 
Naive  Bayes,  Logistic  Regression,  and  Support  Vector  Ma¬ 
chines,  which  have  been  demonstrated  to  be  successful  on 
this  data  set  (Joachims  1999). 

Flat  Models.  The  simplest  approach  we  tried  predicts  the 
categories  based  on  just  the  text  content  on  the  webpage. 
The  text  of  the  webpage  is  represented  using  a  set  of  bi¬ 
nary  attributes  that  indicate  the  presence  of  different  words 
on  the  page.  We  found  that  stemming  and  feature  selection 
did  not  provide  much  benefit  and  simply  pruned  words  that 
appeared  in  fewer  than  three  documents  in  each  of  the  three 
schools  in  the  training  data.  We  also  experimented  with  in¬ 
corporating  meta-data:  words  appearing  in  the  title  of  the 
page,  in  anchors  of  links  to  the  page  and  in  the  last  header 
before  a  link  to  the  page  (Yang  et  al.  2002).  Note  that  meta¬ 
data,  although  mostly  originating  from  pages  linking  into 
the  considered  page,  are  easily  incorporated  as  features, 
i.e.  the  resulting  classification  task  is  still  flat  feature-based 
classification.  Our  first  experimental  setup  compares  three 
well-known  text  classifiers  —  Naive  Bayes,  linear  support 
vector  machines  3  (Svm),  and  logistic  regression  (Logis¬ 
tic)  —  using  words  and  meta-words.  The  results,  shown  in 
Fig.  2(a),  show  that  the  two  discriminative  approaches  out¬ 
perform  Naive  Bayes.  Logistic  and  Svm  give  very  similar 

2The  resulting  category  distribution  is:  course  (237),  faculty 
(148),  other  (332),  research-project  (82)  and  student  (542).  The 
number  of  remaining  pages  for  each  school  are:  Cornell  (280), 
Texas  (292),  Washington  (315)  and  Wisconsin  (454).  The  number 
of  links  for  each  school  are:  Cornell  (574),  Texas  (574),  Washing¬ 
ton  (728)  and  Wisconsin  (1614). 

3We  trained  one-against-others  Svm  for  each  category  and 
during  testing,  picked  the  category  with  the  largest  margin. 


Figure  3:  An  illustration  of  the  Section  model. 

results.  The  average  error  over  the  4  schools  was  reduced 
by  around  4%  by  introducing  the  meta-data  attributes. 

Relational  Models.  Incorporating  meta-data  gives  a  sig¬ 
nificant  improvement,  but  we  can  take  additional  advantage 
of  the  correlation  in  labels  of  related  pages  by  classifying 
them  collectively.  We  want  to  capture  these  correlations  in 
our  model  and  use  them  for  transmitting  information  be¬ 
tween  linked  pages  to  provide  more  accurate  classification. 
We  experimented  with  several  relational  models.  Recall 
that  logistic  regression  is  simply  a  flat  conditional  Markov 
network.  All  of  our  relational  Markov  networks  use  a  lo¬ 
gistic  regression  model  locally  for  each  page. 

Our  first  model  captures  direct  correlations  between  la¬ 
bels  of  linked  pages.  These  correlations  are  very  common 
in  our  data:  courses  and  research  projects  almost  never  link 
to  each  other;  faculty  rarely  link  to  each  other;  students 
have  links  to  all  categories  but  mostly  courses.  The  Link 
model,  shown  in  Fig.  1,  captures  this  correlation  through 
links:  in  addition  to  the  local  bag  of  words  and  meta-data 
attributes,  we  introduce  a  relational  clique  template  over 
the  labels  of  two  pages  that  are  linked. 

A  second  relational  model  uses  the  insight  that  a  web¬ 
page  often  has  internal  structure  that  allows  it  to  be  broken 
up  into  sections.  For  example,  a  faculty  webpage  might 
have  one  section  that  discusses  research,  with  a  list  of  links 
to  all  of  the  projects  of  the  faculty  member,  a  second  sec¬ 
tion  might  contain  links  to  the  courses  taught  by  the  faculty 
member,  and  a  third  to  his  advisees.  This  pattern  is  illus¬ 
trated  in  Fig.  3.  We  can  view  a  section  of  a  webpage  as  a 
fine-grained  version  of  Kleinberg’s  hub  (Kleinberg  1999) 
(a  page  that  contains  a  lot  of  links  to  pages  of  particular 
category).  Intuitively,  if  we  have  links  to  two  pages  in  the 
same  section,  they  are  likely  to  be  on  similar  topics.  To 
take  advantage  of  this  trend,  we  need  to  enrich  our  schema 
with  a  new  relation  Section,  with  attributes  Key,  Doc  (doc¬ 
ument  in  which  it  appears),  and  Category.  We  also  need  to 
add  the  attribute  Section  to  Link  to  refer  to  the  section  it 
appears  in.  In  the  RMN,  we  have  two  new  relational  clique 
templates.  The  first  contains  the  label  of  a  section  and  the 
label  of  the  page  it  is  on: 

SELECT  doc.Category,  sec. Category 

FROM  Doc  doc,  Section  sec 

WHERE  sec.Doc  =  doc. Key 

The  second  clique  template  involves  the  label  of  the  section 
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O.35  ,  □  Exists+Nai've  Bayes  ■  Exists+Logistic  □  Link 
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Figure  2:  (a)  Comparison  of  Naive  Bayes,  Svm,  and  Logistic  on  WebKB,  with  and  without  meta-data  features.  (Only 
averages  over  the  4  schools  are  shown  here.)  (b)  Flat  versus  collective  classification  on  WebKB:  flat  logistic  regression 
with  meta-data,  and  three  different  relational  models:  Link,  Section,  and  a  combined  Section+Link.  (c)  Comparison 
of  generative  and  discriminative  relational  models.  ExistS+NaiveBayes  is  completely  generative.  Exists+Logistic  is 
generative  in  the  links,  but  locally  discriminative  in  the  page  labels  given  the  local  features  (words,  meta-words).  The  Link 
model  is  completely  discriminative. 


containing  the  link  and  the  label  of  the  target  page. 

SELECT  sec. Category,  doc. Category 

FROM  Section  sec,  Link  link.  Doc  doc 

WHERE  link.Sec  =  sec. Key  and  link.To  =  doc. Key 

The  original  dataset  did  not  contain  section  labels,  so 
we  introduced  them  using  the  following  simple  procedure. 
We  defined  a  section  as  a  sequence  of  three  or  more  links 
that  have  the  same  path  to  the  root  in  the  html  parse  tree.  In 
the  training  set,  a  section  is  labeled  with  the  most  frequent 
category  of  its  links.  There  is  a  sixth  category  none,  as¬ 
signed  when  the  two  most  frequent  categories  of  the  links 
are  less  than  a  factor  of  2  apart.  In  the  entire  data  set,  the 
breakdown  of  labels  for  the  sections  we  found  is:  course 
(AO),  faculty  (24),  other  (187),  research.project  (11),  stu¬ 
dent  (71)  and  none  (17).  Note  that  these  labels  are  hidden 
in  the  test  data,  so  the  learning  algorithm  now  also  has  to 
learn  to  predict  section  labels.  Although  not  our  final  aim, 
correct  prediction  of  section  labels  is  very  helpful.  Words 
appearing  in  the  last  header  before  the  section  are  used  to 
better  predict  the  section  label  by  introducing  a  clique  over 
these  words  and  section  labels. 

We  compared  the  performance  of  Link,  Section  and 
Section+Link  (a  combined  model  which  uses  both  types  of 
cliques)  on  the  task  of  predicting  webpage  labels,  relative  to 
the  baseline  of  flat  logistic  regression  with  meta-data.  Our 
experiments  used  MAP  estimation  with  a  Gaussian  prior  on 
the  feature  weights  with  standard  deviation  of  0.3.  Fig.  2(b) 
compares  the  average  error  achieved  by  the  different  mod¬ 
els  on  the  four  schools,  training  on  three  and  testing  on  the 
fourth.  We  see  that  incorporating  any  type  of  relational  in¬ 
formation  consistently  gives  significant  improvement  over 
the  baseline  model.  The  Link  model  incorporates  more  re¬ 
lational  interactions,  but  each  is  a  weaker  indicator.  The 
Section  model  ignores  links  outside  of  coherent  sections, 
but  each  of  the  links  it  includes  is  a  very  strong  indica¬ 
tor.  In  general,  we  see  that  the  Section  models  performs 
slightly  better.  The  joint  model  is  able  to  combine  bene¬ 


fits  from  both  and  generally  outperforms  all  of  the  other 
models.  The  only  exception  is  for  the  task  of  classifying 
the  Wisconsin  data.  In  this  case,  the  joint  Section+Link 
model  contains  many  links,  as  well  as  some  large  tightly 
connected  loops,  so  belief  propagation  did  not  converge 
for  a  subset  of  nodes.  Hence,  the  results  of  the  inference, 
which  was  stopped  at  a  fixed  arbitrary  number  of  iterations, 
were  highly  variable  and  resulted  in  lower  accuracy. 

Discriminative  vs  Generative.  Our  last  experiment  il¬ 
lustrates  the  benefits  of  discriminative  training  in  rela¬ 
tional  classification.  We  compared  three  models.  The  Ex- 
istS+Naive  Bayes  model  is  a  completely  generative  model 
proposed  by  Getoor  et  al.  (2001).  At  each  page,  a  naive 
Bayes  model  generates  the  words  on  a  page  given  the  page 
label.  A  separate  generative  model  specifies  a  probability 
over  the  existence  of  links  between  pages  conditioned  on 
both  pages’  labels.  We  can  also  consider  an  alternative  Ex- 
istS+Logistic  model  that  uses  a  discriminative  model  for 
the  connection  between  page  label  and  words  —  i.e.  uses 
logistic  regression  for  the  conditional  probability  distribu¬ 
tion  of  page  label  given  words.  This  model  has  equiva¬ 
lent  expressive  power  to  the  naive  Bayes  model  but  is  dis- 
criminatively  rather  than  generatively  trained.  Finally,  the 
Link  model  is  a  fully  discriminative  (undirected)  variant  we 
have  presented  earlier,  which  uses  a  discriminative  model 
for  the  label  given  both  words  and  link  existence.  The  re¬ 
sults,  shown  in  Fig.  2(c),  show  that  discriminative  training 
provides  a  significant  improvement  in  accuracy:  the  Link 
model  outperforms  Exists+Logistic  which  in  turn  outper¬ 
forms  Exists+Naive  Bayes. 

As  illustrated  in  Table  1,  the  gain  in  accuracy  comes  at 
some  cost  in  training  time:  for  the  generative  models,  pa¬ 
rameter  estimation  is  closed  form  while  the  discriminative 
models  are  trained  using  conjugate  gradient,  where  each  it¬ 
eration  requires  inference  over  the  unrolled  RMN.  On  the 
other  hand,  both  types  of  models  require  inference  when 
the  model  is  used  on  new  data;  the  generative  model  con- 
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Links 

Links+Section 

Exists+NB 

Training 

1530 

6060 

i 

Testing 

7 

10 

100 

Table  1:  Average  train/test  running  times  (seconds).  All 
runs  were  done  on  a  700Mhz  Pentium  III.  Training  times 
are  averaged  over  four  runs  on  three  schools  each.  Testing 
times  are  averaged  over  four  runs  on  one  school  each. 

structs  a  much  larger,  fully-connected  network,  resulting 
in  significantly  longer  testing  times.  We  also  note  that  the 
situation  changes  if  some  of  the  data  is  unobserved  in  the 
training  set.  In  this  case,  generative  training  also  require  an 
iterative  procedure  (such  as  EM)  where  each  iteration  uses 
the  significantly  more  expressive  inference. 

6  Discussion  and  Conclusions 

In  this  paper,  we  propose  a  new  approach  for  classifica¬ 
tion  in  relational  domains.  Our  approach  provides  a  co¬ 
herent  probabilistic  foundation  for  the  process  of  collective 
classification,  where  we  want  to  classify  multiple  entities, 
exploiting  the  interactions  between  their  labels.  We  have 
shown  that  we  can  exploit  a  very  rich  set  of  relational  pat¬ 
terns  in  classification,  significantly  improving  the  classifi¬ 
cation  accuracy  over  standard  flat  classification. 

In  some  cases,  we  can  incorporate  relational  features 
into  standard  flat  classification.  For  example,  when  clas¬ 
sifying  papers  into  topics,  it  is  possible  to  simply  view  the 
presence  of  particular  citations  as  atomic  features.  How¬ 
ever,  this  approach  is  limited  in  cases  where  some  or  even 
all  of  the  relational  features  that  occur  in  the  test  data  are 
not  observed  in  the  training  data.  In  our  WebKB  example, 
there  is  no  overlap  between  the  webpages  in  the  different 
schools,  so  we  cannot  learn  anything  from  the  training  data 
about  the  significance  of  a  hyperlink  to/from  a  particular 
webpage  in  the  test  data.  Incorporating  basic  features  (e.g., 
words)  from  the  related  entities  can  aid  in  classification,  but 
cannot  exploit  the  strong  correlation  between  the  labels  of 
related  entities  that  RMNs  capture. 

Our  results  in  this  paper  are  only  a  first  step  towards  un¬ 
derstanding  the  power  of  relational  classification.  On  the 
technical  side,  we  can  gain  significant  power  from  intro¬ 
ducing  hidden  variables  (that  are  not  observed  even  in  the 
training  data),  such  as  the  degree  to  which  a  webpage  is  an 
authority  (Kleinberg  1999).  Furthermore,  as  we  discussed, 
there  are  many  other  types  of  relational  patterns  that  we  can 
exploit.  We  can  also  naturally  extend  the  proposed  models 
to  predict  relations  between  entities,  for  example,  advisor- 
advisee,  instructor-course  or  project-member. 

Hypertext  is  the  most  easily  available  source  of  struc¬ 
tured  data,  however,  RMNs  are  generally  applicable  to  any 
relational  domain.  In  particular,  social  networks  provide 
extensive  information  about  interactions  among  people  and 
organizations.  RMNs  offer  a  principled  method  for  learn¬ 
ing  to  predict  communities  of  and  hierarchical  structure  be¬ 
tween  people  and  organizations  based  on  both  the  local  at¬ 


tributes  and  the  patterns  of  static  and  dynamic  interaction. 
Given  the  wealth  of  possible  patterns,  it  is  particularly  in¬ 
teresting  to  explore  the  problem  of  inducing  them  automat¬ 
ically.  We  intend  to  explore  this  topic  in  future  work. 
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